code-digest 0.5.0

High-performance CLI tool to convert codebases to Markdown for LLM context
Documentation
# Architecture Overview

This document describes the high-level architecture of `code-digest`, a high-performance CLI tool for converting codebases to Markdown for LLM context.

## Design Principles

1. **Performance First**: Leverage Rust's performance characteristics and parallel processing
2. **Modularity**: Clear separation of concerns with focused modules
3. **Extensibility**: Easy to add new features without breaking existing functionality
4. **User Experience**: Clear error messages and intuitive CLI interface
5. **LLM Optimization**: Output specifically designed for LLM consumption

## System Architecture

```mermaid
graph TD
    A[CLI Entry Point] --> B[Configuration Parser]
    B --> C[Directory Walker]
    C --> D[File Filter]
    D --> E[File Reader]
    E --> F[Token Counter]
    F --> G[File Prioritizer]
    G --> H[Markdown Generator]
    H --> I{Output Handler}
    I --> J[File Writer]
    I --> K[Stdout]
    I --> L[Gemini CLI]
```

## Core Components

### 1. CLI Interface (`src/cli.rs`)

**Responsibilities:**
- Parse command-line arguments using `clap`
- Load and merge configuration from files
- Validate user input
- Provide help and usage information

**Key Design Decisions:**
- Use derive macros for clean argument definition
- Support both positional arguments and flags
- Allow configuration via file for complex setups

### 2. Directory Walker (`src/core/walker.rs`)

**Responsibilities:**
- Traverse directory structures efficiently
- Apply ignore patterns from `.gitignore` and `.digestignore`
- Handle symbolic links safely
- Detect and skip binary files

**Key Design Decisions:**
- Use `ignore` crate for Git-aware traversal
- Parallel directory scanning with `rayon`
- Stream files rather than collecting all at once
- Skip binary files by detecting null bytes

### 3. Token Counter (`src/core/token.rs`)

**Responsibilities:**
- Count tokens accurately using tiktoken
- Cache token counts for performance
- Support different tokenizer models
- Handle large files efficiently

**Key Design Decisions:**
- Default to cl100k_base encoding (GPT-3.5/4)
- Lazy loading of tokenizer data
- Chunked processing for large files
- Thread-safe caching with `DashMap`

### 4. File Prioritizer (`src/core/prioritizer.rs`)

**Responsibilities:**
- Score files based on importance
- Apply user-defined priority rules
- Ensure essential files are included first
- Handle token budget allocation

**Key Design Decisions:**
- Glob pattern matching for flexibility
- Weight-based scoring system
- Two-pass algorithm: collect then prioritize
- Support for `.digestkeep` essential files

### 5. Markdown Generator (`src/core/digest.rs`)

**Responsibilities:**
- Format files into structured Markdown
- Add appropriate code fence languages
- Generate project structure overview
- Handle special file types

**Key Design Decisions:**
- Include file tree at the beginning
- Use consistent formatting throughout
- Preserve file paths for context
- Support syntax highlighting hints

## Data Flow

1. **Input Phase**
   - CLI arguments parsed
   - Configuration loaded and merged
   - Target directory validated

2. **Discovery Phase**
   - Directory tree traversed
   - Ignore patterns applied
   - Binary files filtered out
   - File metadata collected

3. **Analysis Phase**
   - Files read in parallel
   - Tokens counted per file
   - Priority scores calculated
   - Files sorted by importance

4. **Generation Phase**
   - Project structure generated
   - Files processed in priority order
   - Markdown formatted with syntax hints
   - Token budget tracked

5. **Output Phase**
   - Output directed to appropriate target
   - Progress reported if requested
   - Errors handled gracefully

## Performance Optimizations

### Parallel Processing
- Directory traversal parallelized with `rayon`
- File reading done concurrently
- Token counting distributed across threads

### Memory Efficiency
- Streaming file processing
- Bounded channels for backpressure
- Incremental markdown building
- Smart caching strategies

### I/O Optimization
- Buffered file reading
- Batch writes for output
- Minimal stat calls
- Efficient path handling

## Error Handling Strategy

### Error Types
```rust
pub enum CodeDigestError {
    // File system errors
    InvalidPath(String),
    ReadError(String),
    
    // Configuration errors
    InvalidConfiguration(String),
    ConfigParseError(String),
    
    // Processing errors
    TokenCountError(String),
    MarkdownGenerationError(String),
    
    // External tool errors
    GeminiNotFound,
    SubprocessError(String),
}
```

### Error Propagation
- Use `Result<T, CodeDigestError>` throughout
- Convert external errors at boundaries
- Provide context with error messages
- Fail fast for unrecoverable errors

## Extension Points

### Adding New File Types
1. Update `file_ext.rs` with mappings
2. Add special handling in `digest.rs` if needed
3. Update prioritization weights

### Supporting New LLMs
1. Add tokenizer support in `token.rs`
2. Update CLI to select tokenizer
3. Adjust output format if needed

### Custom Processors
1. Define processor trait
2. Implement for specific file types
3. Register in processing pipeline

## Configuration System

### Precedence Order
1. CLI arguments (highest)
2. Project config file (`.code-digest.toml`)
3. User config file (`~/.config/code-digest/config.toml`)
4. Default values (lowest)

### Configuration Schema
```toml
[defaults]
max_tokens = 150000
verbose = false
progress = true

[ignore]
patterns = ["*.log", "*.tmp"]
use_gitignore = true

[[priorities]]
pattern = "**/*.rs"
weight = 100

[output]
include_tree = true
syntax_highlighting = true
```

## Security Considerations

### Path Traversal
- Validate all paths are within project root
- Don't follow symbolic links by default
- Sanitize file paths in output

### Resource Limits
- Cap maximum file size
- Limit total files processed
- Timeout long operations

### External Commands
- Validate gemini presence
- Sanitize inputs to subprocesses
- Handle subprocess failures gracefully

## Testing Strategy

### Unit Tests
- Test each module in isolation
- Mock file system operations
- Verify error handling paths

### Integration Tests
- Test full processing pipeline
- Use fixture directories
- Verify output correctness

### Performance Tests
- Benchmark large repositories
- Profile memory usage
- Track performance regressions

## Future Enhancements

### Near Term
- Progress bars with `indicatif`
- Colored output with `termcolor`
- Config file watching
- Incremental processing

### Long Term
- Web UI for configuration
- Plugin system for processors
- Cloud storage integration
- Distributed processing

## Dependencies Overview

### Core Dependencies
- `clap` - CLI argument parsing
- `ignore` - Git-aware file traversal
- `tiktoken-rs` - Token counting
- `rayon` - Parallel processing

### Error Handling
- `anyhow` - Error propagation
- `thiserror` - Error definitions

### Utilities
- `walkdir` - Directory traversal (via ignore)
- `once_cell` - Lazy statics
- `serde` - Configuration parsing

## Conclusion

The architecture of `code-digest` is designed to be:
- **Fast**: Parallel processing and efficient algorithms
- **Flexible**: Easy to extend and configure
- **Reliable**: Comprehensive error handling
- **User-friendly**: Clear output and helpful errors

This design allows the tool to handle large codebases efficiently while producing high-quality output for LLM consumption.