codestat 0.1.0

Fast and safe code statistics tool
# CodeStat - AI Agent Guide

## Project Overview

CodeStat (codestat) is a high-performance code statistics tool written in Rust. It counts lines of code and programming language distribution in Git repositories or specified directories.

**Key Characteristics:**
- **Language**: Rust (2024 Edition)
- **Version**: 0.1.0
- **License**: MIT
- **Repository Structure**: Single Rust crate under `codestat/` directory

**Main Features:**
- ⚡ High performance with parallel processing (Rayon) and SIMD acceleration
- 🛡️ Memory-safe pure Rust implementation
- 📊 Detailed statistics: code lines, comment lines, blank lines
- 🔍 Smart detection for 50+ programming languages
- 🎯 Respects `.gitignore` patterns
- 📁 Multiple output formats: table (default), JSON, CSV
- 💾 Incremental caching for faster subsequent runs

## Project Structure

```
codestat/
├── Cargo.toml          # Project configuration
├── Cargo.lock          # Dependency lock file
├── README.md           # Documentation (in Chinese)
└── src/
    ├── main.rs         # CLI entry point, file collection, result output
    ├── counter.rs      # File counting logic (mmap, buffer, SIMD)
    ├── language.rs     # Language detection, comment syntax definitions
    ├── stats.rs        # Statistics data structures (FileStats, TotalStats)
    ├── cache.rs        # Incremental caching system
    ├── mempool.rs      # Memory pool for buffer reuse
    ├── simd_scanner.rs # SIMD-accelerated byte scanning (NEON/SSE/AVX)
    ├── async_counter.rs# Async I/O implementation (experimental)
    └── benchmark.rs    # Built-in benchmark suite
```

## Technology Stack

**Core Dependencies:**
- `clap` (4.5) - Command-line argument parsing with derive macros
- `rayon` (1.10) - Data parallelism and work-stealing scheduler
- `ignore` (0.4) - File traversal with .gitignore support
- `serde`/`serde_json` (1.0) - JSON serialization
- `memmap2` (0.9) - Memory-mapped file I/O (zero-copy)
- `num_cpus` (1.16) - CPU core detection
- `tempfile` (3.14) - Temporary file creation for benchmarks
- `lazy_static` (1.5) - Static variable initialization

**Build Profiles:**
- Release profile optimized with LTO, single codegen unit, panic=abort, stripped binaries

## Build and Run Commands

```bash
# Build debug version
cargo build

# Build optimized release version
cargo build --release

# Run tests
cargo test

# Run the tool
cargo run --release -- [OPTIONS] [PATH]

# Run with specific options
cargo run --release -- --format json /path/to/project
cargo run --release -- --benchmark
```

**Binary Location:** `codestat/target/release/codestat`

## CLI Usage

```bash
# Basic usage
codestat                          # Analyze current directory
codestat /path/to/project         # Analyze specific directory
codestat src/main.rs              # Analyze single file

# Output formats
codestat --format table .         # Table format (default)
codestat --format json .          # JSON output
codestat --format csv .           # CSV output

# Options
codestat --files .                # Show per-file statistics
codestat --progress .             # Show progress
codestat --hidden .               # Include hidden files
codestat --follow-links .         # Follow symbolic links
codestat --no-parallel .          # Disable parallel processing
codestat --fresh .                # Disable cache, force full scan
codestat --rebuild-cache .        # Clear and rebuild cache

# Filtering
codestat --languages rust,python,go .   # Only specific languages
codestat --exclude "*.test.js,*.min.css" .  # Exclude patterns

# Benchmark
codestat --benchmark              # Run internal benchmark suite
```

## Code Organization

### Module Responsibilities

**`main.rs`**
- CLI argument parsing with clap derive macros
- File collection using `ignore::WalkBuilder`
- Cache management (incremental context)
- Parallel/sequential processing orchestration
- Output formatting (table, JSON, CSV)

**`counter.rs`**
- `count_file()` - Main entry for file counting
- `count_file_mmap()` - Memory-mapped I/O for large files (>1MB)
- `count_file_buffered()` - Buffered I/O for small files with memory pool
- `analyze_bytes()` - Byte-level line analysis without UTF-8 validation
- `analyze_line_bytes()` - Comment/code detection per line
- SIMD-accelerated blank line detection

**`language.rs`**
- `Language` enum - 50+ programming languages
- `CommentSyntax` struct - Line and block comment patterns
- `detect_language()` - Extension and filename-based detection
- Static HashMaps with `OnceLock` for O(1) lookup

**`stats.rs`**
- `FileStats` - Per-file statistics (lines, code, comments, blanks, bytes)
- `LanguageStats` - Aggregated per-language statistics
- `TotalStats` - Overall statistics with sorting methods
- `Summary` - Final result with timing and errors

**`cache.rs`**
- `Cache` - Persistent JSON cache storage
- `CacheEntry` - Individual file cache (mtime, size, stats)
- `IncrementalContext` - Cache hit/miss tracking
- Default cache file: `.codestat-cache.json`

**`mempool.rs`**
- `acquire_buffer()` - Get reusable buffer from pool
- `release_buffer()` - Return buffer to pool
- Thread-local and global pools with size limits
- `open_with_advise()` - Linux preadvise optimization

**`simd_scanner.rs`**
- `count_newlines()` - SIMD-accelerated newline counting
- ARM NEON implementation (128-bit)
- x86 SSE2 (128-bit) and AVX2 (256-bit) implementations
- Fallback for non-SIMD platforms

## Testing Strategy

**Unit Tests:**
- Embedded in each module using `#[cfg(test)]`
- Tests for comment parsing in various languages
- Buffer pool operations
- Cache save/load
- SIMD scanner functions

**Benchmark Tests:**
- `codestat --benchmark` runs comprehensive benchmarks
- Language detection performance
- File size scaling (1KB to 10MB)
- Language parsing comparison
- Memory usage tests
- Parallel vs sequential speedup

## Development Conventions

**Code Style:**
- Chinese comments for documentation and explanations
- English for code identifiers (variables, functions, types)
- `#[inline(always)]` for hot path functions
- Extensive use of `unsafe` only in SIMD modules with proper target_feature gates

**Performance Optimizations:**
1. **Zero-copy**: Memory-mapped files for large inputs
2. **Byte-level parsing**: Avoid String/UTF-8 validation
3. **SIMD**: Platform-specific vectorized operations
4. **Memory pooling**: Reuse buffers across file operations
5. **Parallel processing**: Rayon for data parallelism with work-stealing
6. **Incremental caching**: Skip unchanged files on subsequent runs
7. **Smart scheduling**: Sort files by size for better load balancing

**File Size Thresholds:**
- Small files (< 1MB): Buffered I/O with memory pool
- Large files (> 1MB): Memory-mapped I/O
- Parallel threshold: < 100 files use sequential processing

## Supported Languages

**50+ Programming Languages:**
- Systems: Rust, C, C++, Go, Zig, Assembly
- Web: JavaScript, TypeScript, HTML, CSS, SCSS, PHP
- JVM: Java, Kotlin, Scala, Groovy, Clojure
- Mobile: Swift, Objective-C, Dart
- Scripting: Python, Ruby, Perl, Shell, PowerShell
- Functional: Haskell, Elixir, Erlang, F#, Lisp
- Data: R, MATLAB, Julia
- Config: JSON, YAML, XML, TOML, Markdown

## Cache System

**Cache File:** `.codestat-cache.json` (stored in target directory)

**Cache Validity:** Checked via file mtime and size comparison

**Cache Commands:**
- `--fresh` - Ignore existing cache
- `--rebuild-cache` - Clear and rebuild from scratch

**Security:** Cache file created with 0o600 permissions on Unix

## Security Considerations

- Memory-mapped files use `unsafe` but with bounds checking via the mmap crate
- SIMD operations use proper target feature detection
- No execution of arbitrary code
- Respects .gitignore for security (won't traverse sensitive paths)
- File permissions preserved for cache

## Performance Benchmarks (Reference)

Tested on M1 Pro, macOS:
- 4,600 files, 1.6M lines: ~0.42s
- 50K+ files, 20M+ lines: ~3s
- Parallel speedup: ~6.4x on 8-core system

Comparison with other tools:
- codestat: 0.42s (this tool)
- tokei: ~0.6s
- scc: ~1s
- cloc: ~4s