trimdown 0.1.1

File compression CLI tool for PowerPoint, PDF, Video, and Word documents
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
# Trimdown Architecture

## Overview
Trimdown is a Rust-based CLI tool designed with a modular, maintainable architecture following DRY, KISS, and SoC principles. The codebase is organized as a single crate with clear module boundaries and minimal dependencies.

## Project Structure

```
trimdown-rs/
├── src/
│   ├── main.rs           # CLI entry point and orchestration
│   ├── lib.rs            # Public library interface
│   ├── cli.rs            # Command-line argument parsing
│   ├── compression.rs    # Core compression implementations
│   ├── formats.rs        # File type detection and validation
│   ├── processor.rs      # File and folder processing logic
│   └── utils.rs          # Utility functions and helpers
├── Cargo.toml            # Dependencies and build configuration
├── README.md             # User documentation
├── SPEC.md               # Technical specifications
├── ARCHITECTURE.md       # This file
├── TODO.md               # Task tracking
└── tests/                # Integration tests (future)
```

## Module Architecture

### 1. Main Module (`main.rs`)
**Responsibility**: Application entry point and high-level orchestration

**Key Functions**:
- Parse CLI arguments using clap
- Initialize logging with env_logger
- Route to single file or folder processing
- Display application header and basic info

**Dependencies**:
- `cli`: For argument parsing
- `processor`: For file/folder processing
- `colored`: For console output
- `tokio`: For async runtime

**Design Decisions**:
- Async main function to support parallel video compression
- Minimal logic - delegates to processor module
- Clear error messages for user guidance

### 2. CLI Module (`cli.rs`)
**Responsibility**: Command-line interface definition and parsing

**Key Components**:
- `Cli` struct: Main CLI configuration
- `PdfQuality` enum: PDF compression quality levels
- `PdfMethod` enum: PDF compression methods

**Design Decisions**:
- Uses clap derive macros for declarative CLI definition
- Sensible defaults for all optional parameters
- Type-safe enums for quality/method selection
- Clone trait for passing to async tasks

**Configuration Options**:
```rust
pub struct Cli {
    pub input: PathBuf,           // Required input path
    pub output: Option<PathBuf>,  // Optional output path
    pub folder: bool,             // Force folder mode
    pub quality: u8,              // JPEG quality (1-100)
    pub max_width: u32,           // Max image width
    pub video_crf: u8,            // Video CRF (0-51)
    pub pdf_quality: PdfQuality,  // PDF quality level
    pub pdf_method: PdfMethod,    // PDF compression method
    pub force: bool,              // Overwrite existing files
    pub verbose: bool,            // Verbose output
}
```

### 3. Formats Module (`formats.rs`)
**Responsibility**: File type detection and validation

**Key Components**:
- `FileType` enum: Supported file types
- `detect_file_type()`: Extension-based detection
- `is_supported_file()`: Quick validation

**Design Decisions**:
- Simple extension-based detection (sufficient for CLI use)
- Comprehensive test coverage (100%)
- Case-insensitive extension matching
- Clear separation from compression logic

**Supported Types**:
- PowerPoint: .pptx, .ppt
- PDF: .pdf
- Video: .mp4, .avi, .mov, .wmv, .mkv, .m4v, .flv, .webm
- Word: .docx, .doc

### 4. Compression Module (`compression.rs`)
**Responsibility**: Core compression implementations for all file types

**Key Functions**:
- `compress_powerpoint()`: PowerPoint compression with media optimization
- `compress_pdf()`: PDF compression using QPDF
- `compress_video()`: Video compression using FFmpeg
- `compress_word()`: Word document compression
- `compress_image_file()`: Image optimization
- `compress_video_file()`: Video file optimization
- `detect_and_fix_mislabeled_image()`: Smart image format detection

**Design Patterns**:
- **Strategy Pattern**: Different compression strategies per file type
- **Template Method**: Common extraction/compression/repacking workflow
- **Async/Await**: Parallel video compression in PowerPoint files

**Compression Workflows**:

#### PowerPoint Compression
```
1. Extract PPTX ZIP archive → temp directory
2. Scan ppt/media/ for images and videos
3. Compress images sequentially (fast)
4. Compress videos in parallel (slow, benefits from parallelism)
5. Repack ZIP with deflate compression
6. Validate output file
```

#### PDF Compression
```
1. Check for QPDF availability
2. Execute QPDF with optimization flags:
   - Linearization for fast web viewing
   - Stream compression
   - Flate recompression
   - Image optimization
3. Validate PDF header and structure
4. Handle exit codes (0=success, 3=warnings)
```

#### Video Compression
```
1. Check for FFmpeg availability
2. Probe video duration for progress estimation
3. Execute FFmpeg with H.264 encoding:
   - CRF-based quality control
   - AAC audio at 128kbps
   - Fast start for streaming
4. Monitor progress via temporary file
5. Validate output and compression ratio
6. Replace original only if significant compression
```

#### Word Compression
```
1. Extract DOCX ZIP archive → temp directory
2. Scan word/media/ for images
3. Compress images with configured quality
4. Repack ZIP with deflate compression
5. Validate output file
```

**Image Compression Strategy**:
- Skip files < 150KB (already small)
- Detect mislabeled files (e.g., JPEG with .png extension)
- Resize if width > max_width using Lanczos3 filter
- Format-specific optimization:
  - JPEG: Progressive encoding with quality setting
  - PNG: Lossless recompression
  - BMP/TIFF: Convert to WebP or JPEG
  - GIF: Preserve animations, resize only
  - WebP/AVIF: Format-specific optimization

**Error Handling**:
- Graceful degradation when external tools unavailable
- Clear error messages with installation instructions
- Validation of output files before replacing originals
- Atomic file operations to prevent data loss

### 5. Processor Module (`processor.rs`)
**Responsibility**: File and folder processing orchestration

**Key Functions**:
- `process_single_file()`: Single file compression workflow
- `process_folder()`: Batch folder compression workflow

**Design Decisions**:
- Clear separation between single and batch processing
- Comprehensive statistics and reporting
- Force flag validation before processing
- Progress tracking for batch operations

**Single File Workflow**:
```
1. Validate input file exists
2. Generate output path (or use provided)
3. Check for existing output (respect --force flag)
4. Detect file type
5. Execute appropriate compression function
6. Display compression statistics
```

**Folder Workflow**:
```
1. Scan folder for supported files (max depth 1)
2. Display file list with sizes and types
3. Process each file sequentially:
   - Generate output path with _compressed suffix
   - Check for existing output (respect --force flag)
   - Compress file
   - Track statistics
4. Display summary:
   - Files processed
   - Total original size
   - Total compressed size
   - Overall compression ratio
```

### 6. Utils Module (`utils.rs`)
**Responsibility**: Shared utility functions

**Key Functions**:
- `check_external_tool()`: Verify external tool availability
- `print_warning()`: Colored warning messages
- `print_success()`: Colored success messages
- `print_error()`: Colored error messages
- `print_info()`: Colored info messages
- `format_size()`: Human-readable file sizes

**Design Decisions**:
- Simple, focused functions
- No business logic
- 100% test coverage
- Consistent output formatting

## Data Flow

### Single File Compression
```
User Input (CLI)
    main.rs: Parse arguments
    processor.rs: process_single_file()
    formats.rs: detect_file_type()
    compression.rs: compress_*()
    utils.rs: Display results
    User Output (Console)
```

### Batch Folder Compression
```
User Input (CLI)
    main.rs: Parse arguments
    processor.rs: process_folder()
    walkdir: Scan directory
    formats.rs: Filter supported files
    Loop: For each file
    compression.rs: compress_*()
    utils.rs: Display progress
    processor.rs: Aggregate statistics
    User Output (Console)
```

## Concurrency Model

### Async Runtime
- **Runtime**: Tokio multi-threaded runtime
- **Purpose**: Enable parallel video compression within PowerPoint files
- **Scope**: Limited to video processing tasks

### Parallelism Strategy
- **Images**: Sequential processing (fast, I/O bound)
- **Videos**: Parallel processing (slow, CPU bound)
- **Files**: Sequential processing (simplicity, clear progress)

### Synchronization
- **No shared state**: Each task operates on separate files
- **No locks needed**: File system provides atomicity
- **Progress tracking**: Clone progress bar for async tasks

## Error Handling Strategy

### Error Types
1. **User Errors**: Invalid input, missing files, permission issues
2. **System Errors**: Missing external tools, disk space, I/O errors
3. **Format Errors**: Corrupted files, unsupported formats
4. **Compression Errors**: Tool failures, validation failures

### Error Handling Approach
- **anyhow**: For error propagation and context
- **thiserror**: For custom error types (future)
- **Result<T>**: All fallible operations return Result
- **Graceful degradation**: Skip problematic files in batch mode

### Error Messages
- **Clear**: Describe what went wrong
- **Actionable**: Provide resolution steps
- **Colored**: Red for errors, yellow for warnings
- **Contextual**: Include file names and paths

## Dependencies

### Core Dependencies
- **clap**: CLI argument parsing (derive macros)
- **tokio**: Async runtime for parallel processing
- **anyhow**: Error handling and context
- **colored**: Terminal color output

### Compression Dependencies
- **zip**: ZIP archive handling (PPTX, DOCX)
- **flate2**: Deflate compression
- **image**: Image processing and format conversion
- **lopdf**: PDF manipulation (native fallback)

### UI Dependencies
- **indicatif**: Progress bars
- **console**: Terminal utilities

### Utility Dependencies
- **walkdir**: Directory traversal
- **tempfile**: Temporary file/directory management
- **serde**: Serialization (future config files)
- **log/env_logger**: Logging infrastructure

### External Tools
- **FFmpeg**: Video compression (optional)
- **QPDF**: PDF compression (optional)

## Testing Strategy

### Unit Tests
- **Location**: Inline with modules (`#[cfg(test)]`)
- **Coverage**: Individual functions and edge cases
- **Focus**: formats.rs (100% coverage)

### Integration Tests
- **Location**: `tests/` directory (future)
- **Coverage**: End-to-end compression workflows
- **Focus**: Real file processing with validation

### Test Data
- **Sample Files**: Small test files for each format
- **Corrupted Files**: Invalid/corrupted files for error handling
- **Edge Cases**: Empty files, large files, special characters

### Test Utilities
- **tempfile**: Temporary test files and directories
- **assert_cmd**: CLI testing (future)

## Performance Considerations

### Memory Management
- **Streaming**: Process files in chunks where possible
- **Temporary Files**: Use temp directories for extraction
- **Cleanup**: Automatic cleanup via TempDir RAII

### CPU Utilization
- **Parallel Videos**: Utilize all CPU cores for video compression
- **Sequential Images**: Fast enough without parallelism
- **Batch Processing**: Sequential to avoid resource contention

### I/O Optimization
- **Buffered I/O**: Use buffered readers/writers
- **Minimal Copies**: Avoid unnecessary file copies
- **In-Place Updates**: Compress directly when safe

### Progress Reporting
- **Granular Updates**: Update progress every 5 images or 500ms
- **Estimated Duration**: Use video duration for progress calculation
- **Non-Blocking**: Progress monitoring in separate task

## Security Considerations

### Input Validation
- **Path Traversal**: Validate ZIP entries don't escape extraction directory
- **File Size**: Check available disk space before processing
- **Format Validation**: Verify file headers and structure

### External Tools
- **Command Injection**: Use structured arguments, not shell strings
- **Tool Verification**: Check tool availability before execution
- **Output Validation**: Verify compressed files are valid

### File Operations
- **Atomic Writes**: Use temp files and rename for atomicity
- **Permission Checks**: Verify write permissions before processing
- **Cleanup**: Always clean up temporary files

## Extensibility

### Adding New File Types
1. Add variant to `FileType` enum in `formats.rs`
2. Update `detect_file_type()` with new extensions
3. Implement `compress_<type>()` in `compression.rs`
4. Add case in `process_single_file()` and `process_folder()`
5. Add tests for new format

### Adding Compression Options
1. Add field to `Cli` struct in `cli.rs`
2. Update compression functions to use new option
3. Update documentation and help text
4. Add tests for new option

### Adding External Tools
1. Add tool check in relevant compression function
2. Implement fallback strategy if tool unavailable
3. Add installation instructions to error messages
4. Update documentation with new dependency

## Build and Release

### Build Configuration
- **Edition**: Rust 2024
- **Optimization**: Size-optimized release builds (`opt-level = "z"`)
- **LTO**: Enabled for smaller binaries
- **Strip**: Debug symbols removed in release

### Release Process
1. Update version in `Cargo.toml`
2. Update `CHANGELOG.md`
3. Run full test suite: `cargo test`
4. Build release binary: `cargo build --release`
5. Test binary manually
6. Create git tag: `git tag v0.1.0`
7. Push tag: `git push origin v0.1.0`
8. Update Homebrew formula
9. Publish to crates.io (future)

### Distribution
- **Homebrew**: Primary distribution method for macOS
- **Cargo**: `cargo install trimdown` (future)
- **GitHub Releases**: Binary releases for all platforms (future)

## Future Architecture Improvements

### Planned Enhancements
1. **Plugin System**: Allow custom compression strategies
2. **Configuration Files**: Support .trimdownrc for defaults
3. **Trait-Based Design**: Define Compressor trait for extensibility
4. **Streaming API**: Process files without full extraction
5. **Web API**: HTTP API for remote compression
6. **GUI**: Desktop application with drag-and-drop

### Refactoring Opportunities
1. **Error Types**: Custom error types with thiserror
2. **Compression Trait**: Abstract compression interface
3. **Progress Trait**: Pluggable progress reporting
4. **Config Module**: Centralized configuration management
5. **Metrics Module**: Detailed performance metrics

## Maintenance Guidelines

### Code Style
- **Formatting**: Use `rustfmt` with default settings
- **Linting**: Use `clippy` with default lints
- **Documentation**: Document all public APIs
- **Comments**: Explain why, not what

### Dependency Management
- **Updates**: Review and update dependencies quarterly
- **Security**: Monitor for security advisories
- **Minimal**: Only add dependencies when necessary
- **Versions**: Use specific versions, not wildcards

### Testing Requirements
- **Coverage**: Maintain >80% test coverage
- **CI**: Run tests on all commits
- **Benchmarks**: Track performance regressions
- **Integration**: Test with real files regularly

### Documentation
- **README**: User-facing documentation
- **SPEC**: Technical specifications
- **ARCHITECTURE**: This document
- **TODO**: Task tracking and roadmap
- **CHANGELOG**: Version history and changes