# CodeSearch - Technical Specification
## Overview
CodeSearch is a fast, intelligent CLI tool for searching and analyzing codebases, built in Rust. It provides precise structural understanding that complements semantic search and RAG systems for AI agents.
**Version**: 0.1.8
**Language**: Rust (Edition 2024)
**License**: Apache-2.0
## Core Capabilities
### 1. Pattern Search Engine
**Supported Search Modes:**
- **Exact Match**: Direct string matching with line-level precision
- **Regex**: Full regex pattern support with compiled pattern caching
- **Fuzzy**: Typo-tolerant search using Levenshtein distance
- **Semantic**: Context-aware search enhancement
**Features:**
- Parallel file processing with rayon
- Thread-safe LRU caching with automatic eviction
- Relevance scoring and ranking
- Context extraction (surrounding lines)
- Multi-extension filtering
- File/directory exclusion patterns
### 2. Language Support
**48 Languages Supported:**
- **Systems**: Rust, C, C++, Go, Zig, V, Nim
- **Web**: JavaScript, TypeScript, HTML, CSS, SCSS
- **Backend**: Python, Java, Kotlin, C#, PHP, Ruby, Scala, Perl
- **Functional**: Haskell, Elixir, Erlang, Clojure, OCaml, F#
- **Mobile**: Swift, Dart, Objective-C
- **Scripting**: Shell, PowerShell, Lua, R, Julia
- **Data**: SQL, YAML, TOML, JSON, XML
- **Infrastructure**: Dockerfile, Terraform, Makefile
- **Others**: GraphQL, Protobuf, Solidity, WebAssembly, Assembly
**Language-Specific Patterns:**
- Function definitions (e.g., `fn`, `def`, `function`, `func`, `proc`)
- Class/struct definitions
- Import/use statements
- Comment patterns (single-line, multi-line, doc comments)
### 3. Enhanced Dead Code Detection
**Detection Types (6+ categories):**
#### 3.1 Unused Variables
- Detects variables declared but never referenced
- Patterns: `let`, `const`, `var`, `:=`, `<-`
- Excludes: Variables starting with `_`, single-letter vars, `err`
- **Output**: `[var]` marker with line number and reason
#### 3.2 Unreachable Code
- Identifies code after return statements
- Tracks brace depth and control flow
- Detects statements that will never execute
- **Output**: `[!]` marker with truncated code preview
#### 3.3 Empty Functions
- Finds functions with no implementation
- **Brace-based languages**: Detects `{}`
- **Indentation-based languages**: Detects Python `:` with only `pass`
- Excludes special functions (main, test_, constructors, trait implementations)
- **Output**: `[∅]` marker with function name
#### 3.4 TODO/FIXME Markers
- Flags incomplete or problematic code markers
- Markers: TODO, FIXME, HACK, XXX, BUG
- Only detects in comments (not in strings)
- **Output**: `[?]` marker with truncated comment
#### 3.5 Commented-Out Code
- Detects code that has been commented out
- Identifies function/variable declarations in comments
- Excludes documentation comments and standard notes
- **Output**: `[commented code]` with truncated line
#### 3.6 Unused Imports
- Tracks import/use statements
- Counts references across entire file
- Reports imports with ≤1 occurrence
- **Output**: `[imp]` marker with import name
**Special Function Exclusions:**
- Entry points: `main`, `init`, `__init__`
- Test functions: `test_*`, `Test*`
- Lifecycle: `setup`, `teardown`, `drop`, `finalize`
- Trait implementations: `clone`, `fmt`, `eq`, `hash`, `serialize`
- Event handlers: `on*`, `handle*`
- Private functions: `_*`
### 4. Code Complexity Analysis
**Metrics Calculated:**
- **Cyclomatic Complexity**: Number of linearly independent paths
- **Cognitive Complexity**: Measure of code understandability
- **Nesting Depth**: Maximum nesting level
- **Function Count**: Total functions per file
- **Line Count**: Total lines of code
**Thresholds:**
- Low: < 10
- Medium: 10-20
- High: > 20
### 5. Comprehensive Code Metrics
**Halstead Metrics (11 sub-metrics):**
- n1: Number of distinct operators
- n2: Number of distinct operands
- N1: Total operator count
- N2: Total operand count
- Program length: N = N1 + N2
- Vocabulary: n = n1 + n2
- Volume: V = N × log2(n)
- Difficulty: D = (n1/2) × (N2/n2)
- Effort: E = D × V
- Time: T = E / 18
- Bugs: B = E^(2/3) / 3000
**Additional Metrics:**
- Essential Complexity
- NPath Complexity
- Lines of Code (LOC, SLOC, LLOC)
- Code Density & Comment Ratio
- Maintainability Index (MI)
- Code Churn
- Depth of Inheritance Tree (DIT)
- Coupling Between Objects (CBO)
- Lack of Cohesion in Methods (LCOM)
### 6. Design Metrics
**Metrics Calculated:**
- **Afferent Coupling (Ca)**: Number of incoming dependencies
- **Efferent Coupling (Ce)**: Number of outgoing dependencies
- **Instability (I)**: Ce / (Ca + Ce)
- **Abstractness (A)**: Number of abstract classes / total classes
- **Distance from Main Sequence (D)**: |A + I - 1|
- **Package Cohesion (LCOM)**: Lack of cohesion in methods
### 7. Code Duplication Detection
**Algorithm:**
- Extracts code blocks (minimum configurable lines)
- Calculates string similarity using normalized edit distance
- Configurable similarity threshold (default: 0.9)
- Reports file pairs with similar blocks
### 8. Graph Analysis
**Supported Graph Types:**
- **AST**: Abstract Syntax Tree with syntax edges
- **CFG**: Control Flow Graph with basic blocks and branches
- **DFG**: Data Flow Graph tracking variable usage
- **PDG**: Program Dependency Graph (CFG + DFG)
- **Call Graph**: Function call relationships
- **Dependency Graph**: Module and file dependencies
- **Unified Graph**: Combined AST + CFG + DFG in one structure
**Output Formats:**
- Text (human-readable)
- DOT (Graphviz visualization)
- JSON (structured data)
### 9. Symbol Finding
**Find Types:**
- **definition**: Find symbol definitions
- **references**: Find all references to a symbol
- **callers**: Find all callers of a function
**Features:**
- Multi-language support
- Cross-file analysis
- JSON output for automation
### 10. Health Scoring
**Components:**
- Dead code detection
- Code duplication
- Complexity analysis
**Output:**
- Overall health score (0-100)
- Individual component scores
- Structured JSON for CI/CD integration
- `--fail-under` threshold support
### 11. Security Pattern Scanning
**Patterns Detected:**
- `eval()` usage
- `exec()` calls
- SQL injection patterns
- Command injection patterns
- Insecure deserialization
### 12. MCP Server Integration
**Protocol**: Model Context Protocol (MCP)
**Transport**: stdio
**Version**: rmcp 0.12
**Exposed Tools (9 total):**
1. `search` - Pattern search with filters
2. `list` - Directory enumeration
3. `analyze` - Codebase metrics and statistics
4. `complexity` - Complexity analysis
5. `duplicates` - Duplication detection
6. `deadcode` - Dead code analysis
7. `circular` - Circular dependency detection
8. `find_symbol` - Symbol finding (definition, references, callers)
9. `get_health` - Health scoring
**Tool Schemas:**
- JSON schema generation with `schemars`
- Automatic parameter validation
- Structured response formatting
## Architecture
### Module Organization
```
codesearch/
├── src/
│ ├── lib.rs # Library exports
│ ├── main.rs # CLI entry point
│ ├── cli.rs # CLI definitions
│ │
│ ├── commands/ # Command handlers
│ │ ├── mod.rs
│ │ ├── search.rs
│ │ ├── analysis.rs
│ │ ├── graph.rs
│ │ └── util.rs
│ │
│ ├── search/ # Search functionality
│ │ ├── mod.rs
│ │ ├── core.rs
│ │ ├── fuzzy.rs
│ │ ├── semantic.rs
│ │ ├── utilities.rs
│ │ ├── engine.rs
│ │ └── pure.rs
│ │
│ ├── deadcode/ # Dead code detection
│ │ ├── mod.rs
│ │ ├── detectors.rs
│ │ ├── helpers.rs
│ │ └── types.rs
│ │
│ ├── duplicates/ # Duplication detection
│ ├── circular/ # Circular dependencies
│ ├── codemetrics/ # Code metrics
│ ├── designmetrics/ # Design metrics
│ │
│ ├── graphs.rs # Graph analysis interface
│ ├── ast.rs # AST
│ ├── cfg.rs # CFG
│ ├── dfg.rs # DFG
│ ├── pdg.rs # PDG
│ ├── callgraph.rs # Call graph
│ ├── depgraph.rs # Dependency graph
│ ├── unified.rs # Unified graph
│ │
│ ├── find.rs # Symbol finding
│ ├── health.rs # Health scoring
│ ├── security.rs # Security scanning
│ │
│ ├── language/ # Language support
│ ├── parser/ # Code parsers
│ ├── extract/ # Code extraction
│ │
│ ├── cache.rs # Simple cache
│ ├── cache_lru.rs # LRU cache
│ ├── index.rs # Code indexing
│ ├── watcher.rs # File watching
│ │
│ ├── githistory.rs # Git history
│ ├── remote.rs # Remote search
│ │
│ ├── export.rs # Export functionality
│ ├── interactive.rs # REPL mode
│ │
│ ├── mcp/ # MCP integration
│ │ ├── mod.rs
│ │ ├── tools.rs
│ │ ├── schemas.rs
│ │ └── params.rs
│ │
│ ├── types.rs # Shared types
│ ├── traits.rs # Core traits
│ ├── errors.rs # Custom errors
│ └── fs.rs # File system abstraction
│
├── tests/
│ ├── integration_e2e.rs # Integration tests
│ ├── cross_file_tests.rs # Cross-file tests
│ └── fixtures/ # Test fixtures
│
└── benches/
├── search_benchmark.rs # Search benchmarks
└── parser_benchmarks.rs # Parser benchmarks
```
**Total**: ~6000+ lines of Rust code across 40+ modules
### Data Structures
```rust
// Search options (parameter object pattern)
pub struct SearchOptions {
pub extensions: Option<Vec<String>>,
pub ignore_case: bool,
pub fuzzy: bool,
pub fuzzy_threshold: f64,
pub max_results: usize,
pub exclude: Option<Vec<String>>,
pub rank: bool,
pub cache: bool,
pub semantic: bool,
pub benchmark: bool,
pub vs_grep: bool,
}
// Dead code detection result
pub struct DeadCodeItem {
pub file: String,
pub line_number: usize,
pub item_type: String,
pub name: String,
pub reason: String,
}
// Search result
pub struct SearchResult {
pub file: String,
pub line_number: usize,
pub content: String,
pub matches: Vec<Match>,
pub score: f64,
pub relevance: String,
}
// Complexity metrics
pub struct ComplexityMetrics {
pub file_path: String,
pub cyclomatic_complexity: u32,
pub cognitive_complexity: u32,
pub lines_of_code: usize,
pub function_count: usize,
pub max_nesting_depth: u32,
}
// Design metrics
pub struct ModuleMetrics {
pub name: String,
pub afferent_coupling: usize, // Ca
pub efferent_coupling: usize, // Ce
pub instability: f64, // I
pub abstractness: f64, // A
pub distance: f64, // D
pub cohesion: f64, // LCOM
}
```
## Performance Characteristics
### Optimization Strategies
1. **Parallel Processing**
- Uses rayon for multi-threaded file processing
- Scales to available CPU cores
- Thread-safe operations throughout
2. **Caching**
- LRU cache with automatic eviction
- Query-based caching with file modification tracking
- Thread-safe DashMap implementation
- Configurable cache capacity
3. **Memory Efficiency**
- Streaming file reading (no full file loads)
- Efficient data structures (DashMap, ahash)
- Lazy evaluation where possible
4. **Regex Optimization**
- Compiled patterns cached
- Reused across file processing
- Regex compilation outside loops
### Performance Targets
- **Search Latency**: < 50ms for typical queries (< 1000 files)
- **Memory Usage**: < 100MB for moderate codebases (< 10K files)
- **Parallel Efficiency**: 70%+ CPU utilization on multi-core systems
- **Cache Hit Rate**: 70-90% for repeated searches
*Note: Actual performance depends on codebase size, hardware, and query complexity.*
## Testing Strategy
### Test Coverage (173 unit + 36 integration + 23 MCP = 232 tests)
**Unit Tests (173 tests):**
- Co-located with implementation code
- Test individual functions in isolation
- Use temporary directories for file operations
- Pure function testing
**Integration Tests (36 tests):**
- End-to-end CLI command testing
- Output format validation
- Error handling verification
- Cross-file analysis testing
**MCP Tests (23 tests):**
- Tool invocation testing
- Parameter validation
- Response format verification
**Property-Based Tests:**
- `proptest` for fuzzing
- Test invariants
- Generate random inputs
### Test Execution
```bash
# All tests
cargo test --features mcp
# Specific module
cargo test deadcode --lib
# With output
cargo test -- --nocapture
# Run benchmarks
cargo bench
# Generate coverage
cargo tarpaulin --out Html
```
## CLI Interface
### Commands
```bash
# Search
codesearch "<query>" [path] [options]
codesearch interactive
# Analysis
codesearch analyze [path] [--metrics]
codesearch complexity [path] [--threshold N] [--sort]
codesearch deadcode [path] [-e extensions] [--exclude dirs]
codesearch duplicates [path] [--min-lines N] [--similarity N]
codesearch circular [path] [-e extensions]
# Graph
# Health
codesearch health [path] [--fail-under N]
# Security
codesearch security [path] [--extensions]
# Utilities
codesearch files [path] [--extensions]
codesearch languages
codesearch index [path]
codesearch watch [path]
codesearch git-history <query> [path]
codesearch remote --github <query> owner/repo
# MCP Server
codesearch mcp
```
### Options
- `-e, --extensions`: Filter by file extensions (comma-separated)
- `-x, --exclude`: Exclude directories/patterns (comma-separated)
- `-f, --fuzzy`: Enable fuzzy matching
- `-r, --regex`: Enable regex mode
- `-i, --ignore-case`: Case-insensitive search
- `--case-sensitive`: Case-sensitive search
- `--rank`: Rank results by relevance
- `--format`: Output format (text, csv, markdown, json, dot)
- `--output, -o`: Output file path
- `--threshold`: Complexity/similarity threshold
- `--sort`: Sort results
- `--cache`: Enable caching
- `--semantic`: Enable semantic search
- `--benchmark`: Benchmark mode
- `--fail-under`: Fail if score below threshold
## Output Formats
### Dead Code Detection Output
```
🔍 Dead Code Detection
─────────────────────────────
Found 12 potential dead code items:
[src/example.rs]
[var] L 10: variable 'unused_var' - Variable declared but never used
[!] L 25: unreachable - Code after return statement is unreachable
[∅] L 42: empty_helper - Empty function with no implementation
[?] L 58: // TODO: implement this - TODO marker
[imp] L 72: import 'HashMap' - Imported but never used
📊 Summary:
• variable: 3
• unreachable: 2
• empty: 2
• todo: 3
• import: 2
```
### Health Score Output
```
🏥 Code Health Report
─────────────────────────────
Overall Health Score: 85/100 ✅
Components:
• Dead Code: 90/100 (3 issues)
• Duplicates: 95/100 (2 duplicates)
• Complexity: 70/100 (5 high-complexity functions)
Recommendations:
1. Review high-complexity functions in src/auth.rs
2. Remove duplicate code in src/utils.rs
3. Clean up unused variables in src/main.rs
```
## Dependencies
### Production Dependencies
- clap 4.4 - CLI parsing
- regex 1.10 - Pattern matching
- walkdir 2.4 - Directory traversal
- serde 1.0 - Serialization
- serde_json 1.0 - JSON serialization
- colored 2.1 - Terminal colors
- rayon 1.8 - Parallel processing
- dashmap 5.5 - Thread-safe maps
- ahash 0.8 - Fast hashing
- fuzzy-matcher 0.3 - Fuzzy search
- thiserror 1.0 - Custom error types
- anyhow 1.0 - Error propagation
### Optional Dependencies (MCP)
- rmcp 0.12 - MCP protocol
- tokio 1.0 - Async runtime
- schemars 1.2 - JSON schema
### Development Dependencies
- tempfile 3.8 - Temporary files for tests
- proptest 1.4 - Property-based testing
- criterion 0.5 - Benchmarking
## Build Configuration
```toml
[package]
name = "codesearch"
version = "0.1.8"
edition = "2024"
license = "Apache-2.0"
[features]
default = []
mcp = ["rmcp", "tokio", "schemars"]
[dependencies]
# ... dependencies listed above
```
**Rust Edition**: 2024
**MSRV**: Rust 1.70+
**Target**: Native binary (CLI-only, no WASM)
## Quality Standards
### Code Quality
- ✅ **100% test pass rate** (232 tests)
- ✅ **Zero clippy warnings**
- ✅ **Modular architecture** (40+ focused modules)
- ✅ **Thread-safe** parallel processing
- ✅ **Comprehensive error handling**
### Maintainability
- Trait abstractions for extensibility
- Parameter object pattern
- Dependency injection for testability
- Clear separation of concerns
### Performance
- Fast: < 50ms for typical searches
- Parallel: Auto-scales to available CPU cores
- Smart caching: LRU with automatic eviction
- Memory efficient: Streaming file reading
## Future Enhancements
### Planned Features
- AST-based code analysis for more languages
- Incremental indexing for very large codebases
- Enhanced git history search
- Plugin system for custom analyzers
- Web UI for visualization
- ML-based code pattern recognition
### Performance Improvements
- File watching for real-time updates
- Optimized memory usage for large files
- AST caching for frequently accessed files
- Query warming for common searches
## Version History
### 0.1.8 (Current)
- Comprehensive code metrics (Halstead, maintainability, etc.)
- Design metrics (coupling, cohesion, instability)
- Enhanced dead code detection (6+ types)
- MCP server with 9 tools
- Health scoring with CI/CD integration
- Symbol finding (definition, references, callers)
- Security pattern scanning
- Graph analysis (7 graph types)
- Trait abstractions for testability
- LRU cache with automatic eviction
### 0.1.7
- Modular architecture (40+ modules)
- Command handlers extracted from main.rs
- Search engine refactored with traits
- Dependency injection for file system
- Custom error types
- Parameter object pattern
### 0.1.6
- Graph analysis (AST, CFG, DFG, PDG, Call, Dependency, Unified)
- DOT format export
- Find symbol command
- Health scoring
### 0.1.5
- Design metrics module
- Comprehensive code metrics
- Dead code detection enhancements
- Property-based tests
- Benchmark suite
### 0.1.4
- Enhanced dead code detection
- 11 new unit tests
- Updated documentation
### 0.1.3
- MCP server support
- 48 language support
- Complexity metrics
- Code duplication detection
### 0.1.2
- Interactive mode
- Fuzzy search
- Export functionality
### 0.1.1
- Basic search functionality
- Regex support
- Multi-extension filtering
## License
Apache-2.0 License
---