codesearch 0.1.9

# CodeSearch - Technical Specification

## Overview

CodeSearch is a fast, intelligent CLI tool for searching and analyzing codebases, built in Rust. It provides precise structural understanding that complements semantic search and RAG systems for AI agents.

**Version**: 0.1.8  
**Language**: Rust (Edition 2024)  
**License**: Apache-2.0

## Core Capabilities

### 1. Pattern Search Engine

**Supported Search Modes:**
- **Exact Match**: Direct string matching with line-level precision
- **Regex**: Full regex pattern support with compiled pattern caching
- **Fuzzy**: Typo-tolerant search using Levenshtein distance
- **Semantic**: Context-aware search enhancement

**Features:**
- Parallel file processing with rayon
- Thread-safe LRU caching with automatic eviction
- Relevance scoring and ranking
- Context extraction (surrounding lines)
- Multi-extension filtering
- File/directory exclusion patterns

### 2. Language Support

**48 Languages Supported:**
- **Systems**: Rust, C, C++, Go, Zig, V, Nim
- **Web**: JavaScript, TypeScript, HTML, CSS, SCSS
- **Backend**: Python, Java, Kotlin, C#, PHP, Ruby, Scala, Perl
- **Functional**: Haskell, Elixir, Erlang, Clojure, OCaml, F#
- **Mobile**: Swift, Dart, Objective-C
- **Scripting**: Shell, PowerShell, Lua, R, Julia
- **Data**: SQL, YAML, TOML, JSON, XML
- **Infrastructure**: Dockerfile, Terraform, Makefile
- **Others**: GraphQL, Protobuf, Solidity, WebAssembly, Assembly

**Language-Specific Patterns:**
- Function definitions (e.g., `fn`, `def`, `function`, `func`, `proc`)
- Class/struct definitions
- Import/use statements
- Comment patterns (single-line, multi-line, doc comments)

### 3. Enhanced Dead Code Detection

**Detection Types (6+ categories):**

#### 3.1 Unused Variables
- Detects variables declared but never referenced
- Patterns: `let`, `const`, `var`, `:=`, `<-`
- Excludes: Variables starting with `_`, single-letter vars, `err`
- **Output**: `[var]` marker with line number and reason

#### 3.2 Unreachable Code
- Identifies code after return statements
- Tracks brace depth and control flow
- Detects statements that will never execute
- **Output**: `[!]` marker with truncated code preview

#### 3.3 Empty Functions
- Finds functions with no implementation
- **Brace-based languages**: Detects `{}`
- **Indentation-based languages**: Detects Python `:` with only `pass`
- Excludes special functions (main, test_, constructors, trait implementations)
- **Output**: `[∅]` marker with function name

#### 3.4 TODO/FIXME Markers
- Flags incomplete or problematic code markers
- Markers: TODO, FIXME, HACK, XXX, BUG
- Only detects in comments (not in strings)
- **Output**: `[?]` marker with truncated comment

#### 3.5 Commented-Out Code
- Detects code that has been commented out
- Identifies function/variable declarations in comments
- Excludes documentation comments and standard notes
- **Output**: `[commented code]` with truncated line

#### 3.6 Unused Imports
- Tracks import/use statements
- Counts references across entire file
- Reports imports with ≤1 occurrence
- **Output**: `[imp]` marker with import name

**Special Function Exclusions:**
- Entry points: `main`, `init`, `__init__`
- Test functions: `test_*`, `Test*`
- Lifecycle: `setup`, `teardown`, `drop`, `finalize`
- Trait implementations: `clone`, `fmt`, `eq`, `hash`, `serialize`
- Event handlers: `on*`, `handle*`
- Private functions: `_*`

### 4. Code Complexity Analysis

**Metrics Calculated:**
- **Cyclomatic Complexity**: Number of linearly independent paths
- **Cognitive Complexity**: Measure of code understandability
- **Nesting Depth**: Maximum nesting level
- **Function Count**: Total functions per file
- **Line Count**: Total lines of code

**Thresholds:**
- Low: < 10
- Medium: 10-20
- High: > 20

### 5. Comprehensive Code Metrics

**Halstead Metrics (11 sub-metrics):**
- n1: Number of distinct operators
- n2: Number of distinct operands
- N1: Total operator count
- N2: Total operand count
- Program length: N = N1 + N2
- Vocabulary: n = n1 + n2
- Volume: V = N × log2(n)
- Difficulty: D = (n1/2) × (N2/n2)
- Effort: E = D × V
- Time: T = E / 18
- Bugs: B = E^(2/3) / 3000

**Additional Metrics:**
- Essential Complexity
- NPath Complexity
- Lines of Code (LOC, SLOC, LLOC)
- Code Density & Comment Ratio
- Maintainability Index (MI)
- Code Churn
- Depth of Inheritance Tree (DIT)
- Coupling Between Objects (CBO)
- Lack of Cohesion in Methods (LCOM)

### 6. Design Metrics

**Metrics Calculated:**
- **Afferent Coupling (Ca)**: Number of incoming dependencies
- **Efferent Coupling (Ce)**: Number of outgoing dependencies
- **Instability (I)**: Ce / (Ca + Ce)
- **Abstractness (A)**: Number of abstract classes / total classes
- **Distance from Main Sequence (D)**: |A + I - 1|
- **Package Cohesion (LCOM)**: Lack of cohesion in methods

### 7. Code Duplication Detection

**Algorithm:**
- Extracts code blocks (minimum configurable lines)
- Calculates string similarity using normalized edit distance
- Configurable similarity threshold (default: 0.9)
- Reports file pairs with similar blocks

### 8. Graph Analysis

**Supported Graph Types:**
- **AST**: Abstract Syntax Tree with syntax edges
- **CFG**: Control Flow Graph with basic blocks and branches
- **DFG**: Data Flow Graph tracking variable usage
- **PDG**: Program Dependency Graph (CFG + DFG)
- **Call Graph**: Function call relationships
- **Dependency Graph**: Module and file dependencies
- **Unified Graph**: Combined AST + CFG + DFG in one structure

**Output Formats:**
- Text (human-readable)
- DOT (Graphviz visualization)
- JSON (structured data)

### 9. Symbol Finding

**Find Types:**
- **definition**: Find symbol definitions
- **references**: Find all references to a symbol
- **callers**: Find all callers of a function

**Features:**
- Multi-language support
- Cross-file analysis
- JSON output for automation

### 10. Health Scoring

**Components:**
- Dead code detection
- Code duplication
- Complexity analysis

**Output:**
- Overall health score (0-100)
- Individual component scores
- Structured JSON for CI/CD integration
- `--fail-under` threshold support

### 11. Security Pattern Scanning

**Patterns Detected:**
- `eval()` usage
- `exec()` calls
- SQL injection patterns
- Command injection patterns
- Insecure deserialization

### 12. MCP Server Integration

**Protocol**: Model Context Protocol (MCP)  
**Transport**: stdio  
**Version**: rmcp 0.12

**Exposed Tools (9 total):**
1. `search` - Pattern search with filters
2. `list` - Directory enumeration
3. `analyze` - Codebase metrics and statistics
4. `complexity` - Complexity analysis
5. `duplicates` - Duplication detection
6. `deadcode` - Dead code analysis
7. `circular` - Circular dependency detection
8. `find_symbol` - Symbol finding (definition, references, callers)
9. `get_health` - Health scoring

**Tool Schemas:**
- JSON schema generation with `schemars`
- Automatic parameter validation
- Structured response formatting

## Architecture

### Module Organization

```
codesearch/
├── src/
│   ├── lib.rs                    # Library exports
│   ├── main.rs                   # CLI entry point
│   ├── cli.rs                    # CLI definitions
│   │
│   ├── commands/                 # Command handlers
│   │   ├── mod.rs
│   │   ├── search.rs
│   │   ├── analysis.rs
│   │   ├── graph.rs
│   │   └── util.rs
│   │
│   ├── search/                   # Search functionality
│   │   ├── mod.rs
│   │   ├── core.rs
│   │   ├── fuzzy.rs
│   │   ├── semantic.rs
│   │   ├── utilities.rs
│   │   ├── engine.rs
│   │   └── pure.rs
│   │
│   ├── deadcode/                 # Dead code detection
│   │   ├── mod.rs
│   │   ├── detectors.rs
│   │   ├── helpers.rs
│   │   └── types.rs
│   │
│   ├── duplicates/               # Duplication detection
│   ├── circular/                 # Circular dependencies
│   ├── codemetrics/              # Code metrics
│   ├── designmetrics/            # Design metrics
│   │
│   ├── graphs.rs                 # Graph analysis interface
│   ├── ast.rs                    # AST
│   ├── cfg.rs                    # CFG
│   ├── dfg.rs                    # DFG
│   ├── pdg.rs                    # PDG
│   ├── callgraph.rs              # Call graph
│   ├── depgraph.rs               # Dependency graph
│   ├── unified.rs                # Unified graph
│   │
│   ├── find.rs                   # Symbol finding
│   ├── health.rs                 # Health scoring
│   ├── security.rs               # Security scanning
│   │
│   ├── language/                 # Language support
│   ├── parser/                   # Code parsers
│   ├── extract/                  # Code extraction
│   │
│   ├── cache.rs                  # Simple cache
│   ├── cache_lru.rs              # LRU cache
│   ├── index.rs                  # Code indexing
│   ├── watcher.rs                # File watching
│   │
│   ├── githistory.rs             # Git history
│   ├── remote.rs                 # Remote search
│   │
│   ├── export.rs                 # Export functionality
│   ├── interactive.rs            # REPL mode
│   │
│   ├── mcp/                      # MCP integration
│   │   ├── mod.rs
│   │   ├── tools.rs
│   │   ├── schemas.rs
│   │   └── params.rs
│   │
│   ├── types.rs                  # Shared types
│   ├── traits.rs                 # Core traits
│   ├── errors.rs                 # Custom errors
│   └── fs.rs                     # File system abstraction
│
├── tests/
│   ├── integration_e2e.rs        # Integration tests
│   ├── cross_file_tests.rs       # Cross-file tests
│   └── fixtures/                 # Test fixtures
│
└── benches/
    ├── search_benchmark.rs       # Search benchmarks
    └── parser_benchmarks.rs      # Parser benchmarks
```

**Total**: ~6000+ lines of Rust code across 40+ modules

### Data Structures

```rust
// Search options (parameter object pattern)
pub struct SearchOptions {
    pub extensions: Option<Vec<String>>,
    pub ignore_case: bool,
    pub fuzzy: bool,
    pub fuzzy_threshold: f64,
    pub max_results: usize,
    pub exclude: Option<Vec<String>>,
    pub rank: bool,
    pub cache: bool,
    pub semantic: bool,
    pub benchmark: bool,
    pub vs_grep: bool,
}

// Dead code detection result
pub struct DeadCodeItem {
    pub file: String,
    pub line_number: usize,
    pub item_type: String,
    pub name: String,
    pub reason: String,
}

// Search result
pub struct SearchResult {
    pub file: String,
    pub line_number: usize,
    pub content: String,
    pub matches: Vec<Match>,
    pub score: f64,
    pub relevance: String,
}

// Complexity metrics
pub struct ComplexityMetrics {
    pub file_path: String,
    pub cyclomatic_complexity: u32,
    pub cognitive_complexity: u32,
    pub lines_of_code: usize,
    pub function_count: usize,
    pub max_nesting_depth: u32,
}

// Design metrics
pub struct ModuleMetrics {
    pub name: String,
    pub afferent_coupling: usize,  // Ca
    pub efferent_coupling: usize,  // Ce
    pub instability: f64,          // I
    pub abstractness: f64,         // A
    pub distance: f64,             // D
    pub cohesion: f64,             // LCOM
}
```

## Performance Characteristics

### Optimization Strategies

1. **Parallel Processing**
   - Uses rayon for multi-threaded file processing
   - Scales to available CPU cores
   - Thread-safe operations throughout

2. **Caching**
   - LRU cache with automatic eviction
   - Query-based caching with file modification tracking
   - Thread-safe DashMap implementation
   - Configurable cache capacity

3. **Memory Efficiency**
   - Streaming file reading (no full file loads)
   - Efficient data structures (DashMap, ahash)
   - Lazy evaluation where possible

4. **Regex Optimization**
   - Compiled patterns cached
   - Reused across file processing
   - Regex compilation outside loops

### Performance Targets

- **Search Latency**: < 50ms for typical queries (< 1000 files)
- **Memory Usage**: < 100MB for moderate codebases (< 10K files)
- **Parallel Efficiency**: 70%+ CPU utilization on multi-core systems
- **Cache Hit Rate**: 70-90% for repeated searches

*Note: Actual performance depends on codebase size, hardware, and query complexity.*

## Testing Strategy

### Test Coverage (173 unit + 36 integration + 23 MCP = 232 tests)

**Unit Tests (173 tests):**
- Co-located with implementation code
- Test individual functions in isolation
- Use temporary directories for file operations
- Pure function testing

**Integration Tests (36 tests):**
- End-to-end CLI command testing
- Output format validation
- Error handling verification
- Cross-file analysis testing

**MCP Tests (23 tests):**
- Tool invocation testing
- Parameter validation
- Response format verification

**Property-Based Tests:**
- `proptest` for fuzzing
- Test invariants
- Generate random inputs

### Test Execution

```bash
# All tests
cargo test --features mcp

# Specific module
cargo test deadcode --lib

# With output
cargo test -- --nocapture

# Run benchmarks
cargo bench

# Generate coverage
cargo tarpaulin --out Html
```

## CLI Interface

### Commands

```bash
# Search
codesearch "<query>" [path] [options]
codesearch interactive

# Analysis
codesearch analyze [path] [--metrics]
codesearch complexity [path] [--threshold N] [--sort]
codesearch deadcode [path] [-e extensions] [--exclude dirs]
codesearch duplicates [path] [--min-lines N] [--similarity N]
codesearch circular [path] [-e extensions]

# Graph
codesearch graph <ast|cfg|dfg|pdg|callgraph|depgraph|unified> [file]
codesearch find <symbol> [path] --type <definition|references|callers>

# Health
codesearch health [path] [--fail-under N]

# Security
codesearch security [path] [--extensions]

# Utilities
codesearch files [path] [--extensions]
codesearch languages
codesearch index [path]
codesearch watch [path]
codesearch git-history <query> [path]
codesearch remote --github <query> owner/repo

# MCP Server
codesearch mcp
```

### Options

- `-e, --extensions`: Filter by file extensions (comma-separated)
- `-x, --exclude`: Exclude directories/patterns (comma-separated)
- `-f, --fuzzy`: Enable fuzzy matching
- `-r, --regex`: Enable regex mode
- `-i, --ignore-case`: Case-insensitive search
- `--case-sensitive`: Case-sensitive search
- `--rank`: Rank results by relevance
- `--format`: Output format (text, csv, markdown, json, dot)
- `--output, -o`: Output file path
- `--threshold`: Complexity/similarity threshold
- `--sort`: Sort results
- `--cache`: Enable caching
- `--semantic`: Enable semantic search
- `--benchmark`: Benchmark mode
- `--fail-under`: Fail if score below threshold

## Output Formats

### Dead Code Detection Output

```
🔍 Dead Code Detection
─────────────────────────────

Found 12 potential dead code items:

[src/example.rs]
   [var] L  10: variable 'unused_var' - Variable declared but never used
   [!]   L  25: unreachable - Code after return statement is unreachable
   [∅]   L  42: empty_helper - Empty function with no implementation
   [?]   L  58: // TODO: implement this - TODO marker
   [imp] L  72: import 'HashMap' - Imported but never used

📊 Summary:
   • variable: 3
   • unreachable: 2
   • empty: 2
   • todo: 3
   • import: 2
```

### Health Score Output

```
🏥 Code Health Report
─────────────────────────────

Overall Health Score: 85/100 ✅

Components:
  • Dead Code: 90/100 (3 issues)
  • Duplicates: 95/100 (2 duplicates)
  • Complexity: 70/100 (5 high-complexity functions)

Recommendations:
  1. Review high-complexity functions in src/auth.rs
  2. Remove duplicate code in src/utils.rs
  3. Clean up unused variables in src/main.rs
```

## Dependencies

### Production Dependencies
- clap 4.4 - CLI parsing
- regex 1.10 - Pattern matching
- walkdir 2.4 - Directory traversal
- serde 1.0 - Serialization
- serde_json 1.0 - JSON serialization
- colored 2.1 - Terminal colors
- rayon 1.8 - Parallel processing
- dashmap 5.5 - Thread-safe maps
- ahash 0.8 - Fast hashing
- fuzzy-matcher 0.3 - Fuzzy search
- thiserror 1.0 - Custom error types
- anyhow 1.0 - Error propagation

### Optional Dependencies (MCP)
- rmcp 0.12 - MCP protocol
- tokio 1.0 - Async runtime
- schemars 1.2 - JSON schema

### Development Dependencies
- tempfile 3.8 - Temporary files for tests
- proptest 1.4 - Property-based testing
- criterion 0.5 - Benchmarking

## Build Configuration

```toml
[package]
name = "codesearch"
version = "0.1.8"
edition = "2024"
license = "Apache-2.0"

[features]
default = []
mcp = ["rmcp", "tokio", "schemars"]

[dependencies]
# ... dependencies listed above
```

**Rust Edition**: 2024  
**MSRV**: Rust 1.70+  
**Target**: Native binary (CLI-only, no WASM)

## Quality Standards

### Code Quality
- ✅ **100% test pass rate** (232 tests)
- ✅ **Zero clippy warnings**
- ✅ **Modular architecture** (40+ focused modules)
- ✅ **Thread-safe** parallel processing
- ✅ **Comprehensive error handling**

### Maintainability
- Trait abstractions for extensibility
- Parameter object pattern
- Dependency injection for testability
- Clear separation of concerns

### Performance
- Fast: < 50ms for typical searches
- Parallel: Auto-scales to available CPU cores
- Smart caching: LRU with automatic eviction
- Memory efficient: Streaming file reading

## Future Enhancements

### Planned Features
- AST-based code analysis for more languages
- Incremental indexing for very large codebases
- Enhanced git history search
- Plugin system for custom analyzers
- Web UI for visualization
- ML-based code pattern recognition

### Performance Improvements
- File watching for real-time updates
- Optimized memory usage for large files
- AST caching for frequently accessed files
- Query warming for common searches

## Version History

### 0.1.8 (Current)
- Comprehensive code metrics (Halstead, maintainability, etc.)
- Design metrics (coupling, cohesion, instability)
- Enhanced dead code detection (6+ types)
- MCP server with 9 tools
- Health scoring with CI/CD integration
- Symbol finding (definition, references, callers)
- Security pattern scanning
- Graph analysis (7 graph types)
- Trait abstractions for testability
- LRU cache with automatic eviction

### 0.1.7
- Modular architecture (40+ modules)
- Command handlers extracted from main.rs
- Search engine refactored with traits
- Dependency injection for file system
- Custom error types
- Parameter object pattern

### 0.1.6
- Graph analysis (AST, CFG, DFG, PDG, Call, Dependency, Unified)
- DOT format export
- Find symbol command
- Health scoring

### 0.1.5
- Design metrics module
- Comprehensive code metrics
- Dead code detection enhancements
- Property-based tests
- Benchmark suite

### 0.1.4
- Enhanced dead code detection
- 11 new unit tests
- Updated documentation

### 0.1.3
- MCP server support
- 48 language support
- Complexity metrics
- Code duplication detection

### 0.1.2
- Interactive mode
- Fuzzy search
- Export functionality

### 0.1.1
- Basic search functionality
- Regex support
- Multi-extension filtering

## License

Apache-2.0 License

---

**Built with ❤️ in Rust** | **Precise** | **Fast** | **Agent-Ready**