guardy 0.2.4 - Docs.rs

# Scanner Module

Comprehensive secret detection and file content analysis using pattern matching, entropy analysis,
and intelligent filtering. Detects 40+ types of secrets including private keys, API tokens, database
credentials, and more.

## Architecture

```
src/scanner/
├── mod.rs           # Module routing and re-exports only
├── core.rs          # Main Scanner struct and scanning logic
├── directory.rs     # DirectoryHandler and parallel coordination
├── patterns.rs      # Secret pattern definitions and regex compilation
├── entropy.rs       # Statistical entropy analysis algorithms
├── types.rs         # Core types (ScanResult, ScanStats, etc.)
├── test_detection.rs # Intelligent test code block detection
└── README.md        # This documentation
```

## Files and Responsibilities

### `core.rs`

- **Purpose**: Core scanning logic and individual file processing
- **Contains**: `Scanner` struct, individual file scan methods, pattern matching orchestration
- **Tests**: Scanner creation, single file scanning, pattern matching accuracy

### `directory.rs`

- **Purpose**: Directory scanning coordination and parallel execution
- **Contains**: `DirectoryHandler`, worker adaptation, execution strategy coordination, gitignore
  analysis
- **Tests**: Directory filtering, parallel execution, worker adaptation strategies

### `patterns.rs`

- **Purpose**: Secret pattern definitions and regex management
- **Contains**: `SecretPatterns`, `SecretPattern`, 40+ predefined patterns for comprehensive secret
  detection
- **Built-in Detection**: Private keys (SSH, PGP, RSA, etc.), API keys (OpenAI, GitHub, AWS, etc.),
  database credentials, JWT tokens
- **Tests**: Pattern compilation, pattern matching, coverage of AI/cloud service patterns

### `entropy.rs`

- **Purpose**: Statistical analysis for randomness detection
- **Contains**: `is_likely_secret()` function, entropy calculation algorithms
- **Tests**: Entropy analysis accuracy, threshold validation, realistic vs fake secrets

### `types.rs`

- **Purpose**: Core data structures and type definitions
- **Contains**: `ScanResult`, `ScanStats`, `ScanMode`, `SecretMatch`, `Warning`, etc.
- **Tests**: Type serialization, result aggregation, statistics calculation

### `test_detection.rs`

- **Purpose**: Intelligent test code block detection across multiple languages
- **Contains**: `TestDetector`, block boundary detection, language-specific parsing
- **Tests**: Rust test blocks, TypeScript/JavaScript test suites, Python test functions

### `mod.rs`

- **Purpose**: Module organization only
- **Contains**: Module declarations and re-exports
- **Tests**: None (routing only)

## Test Organization Guidelines

**✅ DO:**

- Put tests inline with `#[cfg(test)] mod tests` in each implementation file
- Test the specific functionality in the same file where it's implemented
- Keep scanner tests in `core.rs`, pattern tests in `patterns.rs`, etc. **❌ DON'T:**
- Put tests in `mod.rs` (routing only)
- Create separate `tests.rs` files (use inline tests)
- Mix tests from different components in one file

## Data Flow

```
Scanner (core.rs)
    ↓
File Reading → Line Scanning → Pattern Matching (patterns.rs)
    ↓                              ↓
SecretMatch ← Entropy Analysis (entropy.rs)
    ↓
ScanResult with Statistics
```

## Scanner Ignore Mechanisms

The scanner provides four intelligent ignore mechanisms to prevent false positives:

### 1. **Path-based Ignoring** (`ignore_paths`)

Uses glob patterns to ignore entire files and directories:

```toml
[scanner]
ignore_paths = [
  "tests/*", # All test directories
  "testdata/*", # Test data directories
  "*_test.rs", # Test files
  "test_*.rs", # Test files
]
```

### 2. **Pattern-based Ignoring** (`ignore_patterns`)

Ignores lines containing specific patterns:

```toml
[scanner]
ignore_patterns = [
  "# TEST_SECRET:", # Lines marked as test secrets
  "DEMO_KEY_", # Demo/fake keys
  "FAKE_", # Fake credentials
]
```

### 3. **Comment-based Ignoring** (`ignore_comments`)

Inline comments to suppress scanning:

```toml
[scanner]
ignore_comments = [
  "guardy:ignore", # Ignore this line
  "guardy:ignore-line", # Ignore this line
  "guardy:ignore-next", # Ignore next line
]
```

**Usage:**

```rust
let secret = "sk_live_real_key"; // guardy:ignore-line
// guardy:ignore-next
let another_secret = "sk_test_fake_key";
```

### 4. **Intelligent Test Code Detection** (`ignore_test_code`)

Automatically detects and ignores test code across multiple languages:

```toml
[scanner]
ignore_test_code = true
test_attributes = [
  # Rust test patterns
  "#[*test]", # Matches #[test], #[tokio::test], etc.
  "#[bench]", # Benchmark functions
  "#[cfg(test)]", # Test configuration
  # Python test patterns
  "def test_*", # Test functions
  "class Test*", # Test classes
  "@pytest.*", # Pytest decorators
  # TypeScript/JavaScript test patterns
  "it(*", # Jest/Mocha it() blocks
  "test(*", # Jest test() blocks
  "describe(*", # Jest/Mocha describe() blocks
]
test_modules = [
  # Rust
  "mod tests {", # Test modules
  "mod test {", # Test modules
  # Python
  "class Test", # Test classes
  # TypeScript/JavaScript
  "describe(", # Test suites
  "__tests__", # Test directories
]
```

**Detected patterns by language:** **Rust:**

- `#[test]`, `#[tokio::test]`, `#[async_test]`, `#[wasm_bindgen_test]`
- `#[bench]` benchmark functions
- `#[cfg(test)]` conditional compilation
- `mod tests {` and `mod test {` test modules **Python:**
- `def test_*` test functions
- `class Test*` test classes
- `@pytest.*` pytest decorators
- `class Test` test class declarations **TypeScript/JavaScript:**
- `it(` Jest/Mocha test cases
- `test(` Jest test cases
- `describe(` Jest/Mocha test suites

## Configuration

All ignore mechanisms are configurable via `guardy.toml`:

```toml
[scanner]
# Enable/disable each mechanism
ignore_test_code = true
# Customize patterns for your project
ignore_patterns = [
  "# DEMO:",
  "EXAMPLE_",
  "YOUR_CUSTOM_PATTERN",
]
# Add custom test attributes
test_attributes = [
  "#[*test]",
  "#[custom::test]",
]
```

## Integration with Other Modules

- **Config**: Gets scanner configuration and pattern customization
- **Git**: Integrates with git file discovery for targeted scanning
- **CLI**: Provides scan results for command-line output
- **MCP**: Exposes scanning capabilities via MCP server interface
- **Parallel**: Coordinates parallel execution strategies and resource management

### Parallel Module Integration

The scanner module integrates tightly with the parallel module for efficient file processing:

#### Execution Strategies

- **Sequential**: Single-threaded scanning for small workloads
- **Parallel**: Multi-threaded scanning with domain-adapted worker counts
- **Auto**: Threshold-based automatic strategy selection

#### Resource Management Flow

```text
1. Scanner Config       →  2. Resource Calculation      →  3. Domain Adaptation
   ┌─────────────────┐      ┌──────────────────────────┐     ┌─────────────────────┐
   │ • max_threads   │      │ CPU cores: 16            │     │ File count: 36      │
   │ • thread_%: 75% │  ──▶ │ 16 * 75% = 12 workers    │ ──▶ │ ≤50 → 12/2 = 6     │
   │ • mode: auto    │      │ (system resource limit)  │     │ (domain adaptation)  │
   └─────────────────┘      └──────────────────────────┘     └─────────────────────┘
                                                                       │
4. Strategy Decision                          ← ← ← ← ← ← ← ← ← ← ← ← ← ←
   ┌─────────────────────────────────────────┐
   │ auto(file_count=36, threshold=50, workers=6) │
   │ → 36 < 50 → ExecutionStrategy::Sequential    │
   └─────────────────────────────────────────────┘
```

#### Worker Adaptation Strategy

The scanner implements domain-specific worker adaptation in
`DirectoryHandler::adapt_workers_for_file_count()`:

- **≤10 files**: Minimal parallelism (overhead exceeds benefits)
- **≤50 files**: Conservative parallelism (50% of max workers)
- **≤100 files**: Moderate parallelism (75% of max workers)
- **>100 files**: Full parallelism (all available workers)

## Notes for AI Assistants and Developers

### 🤖 AI Assistant Guidelines

#### When Working with Scanner Module:

- **Use `DirectoryHandler::scan()`** as the primary entry point for directory scanning
- **Let the module handle strategy decisions** unless explicit override needed
- **Trust the domain adaptation logic** for worker scaling based on file counts
- **Respect the filtered directory patterns** for optimal performance

#### Key Integration Points:

1. **File Discovery**: Use built-in directory filtering and walking logic
2. **Parallel Coordination**: Integrate with parallel module for resource management
3. **Progress Reporting**: Use configured progress reporters with appropriate icons
4. **Result Aggregation**: Collect and combine scan results with statistics

#### Common Usage Patterns:

```rust
use guardy::scanner::directory::DirectoryHandler;
use guardy::scanner::Scanner;
use std::sync::Arc;
// Primary scanning workflow - uses global GUARDY_CONFIG
let scanner = Arc::new(Scanner::new()?);
let directory_handler = DirectoryHandler::default();
// Automatic strategy selection
let result = directory_handler.scan(scanner, path, None)?;
// Explicit strategy override
let strategy = ExecutionStrategy::Parallel { workers: 4 };
let result = directory_handler.scan(scanner, path, Some(strategy))?;
```

### 🔧 Development Guidelines

#### File Architecture Updates:

The current file structure reflects the parallel integration:

```
src/scanner/
├── mod.rs           # Module routing and re-exports
├── core.rs          # Main Scanner struct and scanning logic
├── directory.rs     # DirectoryHandler and parallel coordination
├── patterns.rs      # Secret pattern definitions and regex compilation
├── entropy.rs       # Statistical entropy analysis algorithms
├── types.rs         # Core types (ScanResult, ScanStats, etc.)
├── test_detection.rs # Intelligent test code block detection
└── README.md        # This documentation
```

#### Key Responsibilities by File:

##### `directory.rs` (New/Enhanced)

- **Purpose**: Directory scanning coordination and parallel execution
- **Contains**: `DirectoryHandler`, worker adaptation, execution strategy coordination
- **Integration**: Primary interface between scanner and parallel modules

##### `core.rs` (Updated)

- **Purpose**: Core scanning logic and file processing
- **Contains**: `Scanner` struct, individual file scanning methods
- **Focus**: Single-file processing, pattern matching orchestration

##### `types.rs` (Updated)

- **Purpose**: Core data structures and enums
- **Contains**: `ScanResult`, `ScanStats`, `ScanMode`, `Warning`, etc.
- **Usage**: Shared types across scanner modules

#### Adding New Features:

- **Directory Filtering**: Extend `DirectoryHandler::default()` with new patterns
- **Worker Adaptation**: Modify thresholds in `adapt_workers_for_file_count()`
- **Progress Reporting**: Customize icons and frequency in execution strategies
- **File Processing**: Add new scan methods to `Scanner` in `core.rs`

### 🎯 Performance Optimization

#### OS Cache Optimization:

- **Intelligent caching**: Leverages OS filesystem cache for dramatic performance improvements
- **Cold cache**: ~1,900 files/second initial scan performance
- **Warm cache**: ~5,200 files/second (2.7x improvement) on subsequent scans
- **Real-world benefits**: Perfect for CI/CD workflows and iterative development
- **Example**: 172,832 files scanned in 91s (cold) vs 33s (warm) - 63% faster!

#### Directory Filtering Impact:

- Reduces scan time by 60-80% by skipping build/cache directories
- Automatic gitignore analysis provides optimization suggestions
- Language-specific patterns (node_modules, target, **pycache**, etc.)

#### Parallel Execution Benefits:

- File-count-aware worker scaling
- Resource-aware execution strategy selection
- Automatic threshold-based parallel/sequential decisions

#### Memory Management:

- Arc<Scanner> enables thread-safe sharing across workers
- Bounded channels prevent memory overflow in large directories
- Progress reporting optimized for minimal contention
- Typical memory usage: <200MB for repositories with 100k+ files

### 🚨 Common Pitfalls to Avoid

1. **Don't bypass DirectoryHandler**: Use the coordinated scanning approach
2. **Don't hardcode execution strategies**: Let auto mode optimize for workload
3. **Don't ignore filtered directories**: They're essential for performance
4. **Don't mix scanning and parallel logic**: Keep separation of concerns

### 📊 Configuration Integration

#### Scanner-Specific Settings:

```toml
[scanner]
mode = "auto" # Sequential/Parallel/Auto
max_threads = 0 # 0 = no limit
thread_percentage = 75 # Use 75% of CPU cores
min_files_for_parallel = 50 # Threshold for auto mode
# Ignore mechanisms
ignore_test_code = true
ignore_paths = ["tests/*", "*_test.rs"]
ignore_patterns = ["# TEST_SECRET:", "DEMO_KEY_"]
ignore_comments = ["guardy:ignore", "guardy:ignore-line"]
```

#### Progress Reporting Configuration:

- **Sequential**: ⏳ icon, 10-item frequency
- **Parallel**: ⚡ icon, 5-item frequency
- **Custom**: Configurable via progress reporter factories

## Supported Secret Types

The scanner includes 40+ built-in patterns for comprehensive secret detection:

### Private Keys & Certificates

- SSH private keys (RSA, DSA, EC, OpenSSH, SSH2)
- PGP/GPG private keys (armored format)
- PKCS private keys (standard format)
- PuTTY private keys (all versions)
- Age encryption keys (modern file encryption)

### Cloud Provider Credentials

- **AWS**: Access keys, secret keys, session tokens
- **Azure**: Client secrets, storage keys
- **Google Cloud**: API keys, service account keys

### API Keys & Tokens

- **AI/ML**: OpenAI, Anthropic Claude, Hugging Face, Cohere, Replicate, Mistral
- **Development**: GitHub tokens, GitLab tokens, npm tokens
- **Services**: Slack tokens, SendGrid keys, Twilio credentials, Mailchimp keys, Stripe keys, Square
  tokens
- **JWT/JWE**: JSON Web Tokens

### Database Credentials

- MongoDB connection strings
- PostgreSQL connection strings
- MySQL connection strings

### Generic Detection

- **Context-based patterns**: High-entropy strings near keywords like "password", "token", "key",
  "secret", "api"
- **URL credentials**: `https://user:pass@host` patterns
- **Custom configurable patterns**: Add your own regex patterns via configuration

### Pattern Matching Strategy

1. **Specific patterns**: Known formats for popular services (high precision)
2. **Generic context patterns**: Detect unknown secrets using contextual keywords + high entropy
3. **Entropy analysis**: Statistical validation of randomness for suspected secrets
4. **Intelligent filtering**: Skip test code, demo data, and false positives

## Usage Examples

```rust
use crate::scanner::{Scanner, SecretPatterns};
// Create scanner (uses static CONFIG internally)
let scanner = Scanner::new()?;
// Scan individual file
let matches = scanner.scan_file(&path)?;
for secret_match in matches {
    println!("Found {} at {}:{}",
        secret_match.secret_type,
        secret_match.file_path,
        secret_match.line_number);
}
// Scan directory with full results
let result = scanner.scan_directory(&dir_path)?;
println!("Found {} secrets in {} files",
    result.stats.total_matches,
    result.stats.files_scanned);
// CLI usage examples
// guardy scan src/ --stats
// guardy scan config.json --include-binary
// guardy scan . --max-file-size 50
```