# Advanced Features
This section covers Debtmap's advanced analysis capabilities: purity detection, data flow analysis, entropy-based complexity, and context-aware analysis.
## Purity Detection
Debtmap detects pure functions - those without side effects that always return the same output for the same input.
**What makes a function pure:**
- No I/O operations (file, network, database)
- No mutable global state
- No random number generation
- No system calls
- Deterministic output
**Purity detection is optional:**
- Both `is_pure` and `purity_confidence` are `Option` types
- May be `None` for some functions or languages where detection is not available
- Rust has the most comprehensive purity detection support
**Four-level purity classification:**
The `PurityLevel` enum (`src/core/mod.rs:49-62`) provides more nuanced classification than the binary `is_pure`:
- **StrictlyPure**: No mutations whatsoever - pure mathematical functions
- **LocallyPure**: Uses local mutations for efficiency but no external side effects (builder patterns, accumulators, owned `mut self`)
- **ReadOnly**: Reads external state but doesn't modify it (constants, `&self` methods)
- **Impure**: Modifies external state or performs I/O (`&mut self`, statics, I/O)
This four-level classification enables better scoring for functions that use local mutations for efficiency but are functionally pure (referentially transparent). See [Complexity Metrics](complexity-metrics.md) for how purity affects scoring.
**Confidence scoring (when available):**
- **0.9-1.0**: Very confident (no side effects detected)
- **0.7-0.8**: Likely pure (minimal suspicious patterns)
- **0.5-0.6**: Uncertain (some suspicious patterns)
- **0.0-0.4**: Likely impure (side effects detected)
**Example:**
```rust
// Pure: confidence = 0.95
fn calculate_total(items: &[Item]) -> f64 {
items.iter().map(|i| i.price).sum()
}
// Impure: confidence = 0.1 (I/O detected)
fn save_total(items: &[Item]) -> Result<()> {
let total = items.iter().map(|i| i.price).sum();
write_to_file(total) // Side effect!
}
```
**Benefits:**
- Pure functions are easier to test
- Can be safely cached or memoized
- Safe to parallelize
- Easier to reason about
## Data Flow Analysis
Debtmap builds a comprehensive `DataFlowGraph` that extends basic call graph analysis with variable dependencies, data transformations, I/O operations, and purity tracking.
### Call Graph Foundation
**Upstream callers** - Who calls this function
- Indicates impact radius
- More callers = higher impact if it breaks
**Downstream callees** - What this function calls
- Indicates dependencies
- More callees = more integration testing needed
**Example:**
```json
{
"name": "process_payment",
"upstream_callers": [
"handle_checkout",
"process_subscription",
"handle_refund"
],
"downstream_callees": [
"validate_payment_method",
"calculate_fees",
"record_transaction",
"send_receipt"
]
}
```
### Variable Dependency Tracking
`DataFlowGraph` tracks which variables each function depends on (`src/data_flow/mod.rs:119`):
```rust
pub struct DataFlowGraph {
// Maps function_id -> set of variable names used
variable_deps: HashMap<FunctionId, HashSet<String>>,
// ...
}
```
**What it tracks:**
- Function parameters (primary source via extraction adapters)
- Local variables accessed in function body
- Captured variables (closures)
**Note:** Variable dependency tracking stores variable *names* only (as `HashSet<String>`). It does not track mutability information - that analysis is handled separately by the purity detection system.
**Benefits:**
- Identify functions coupled through shared state
- Detect potential side effect chains
- Guide refactoring to reduce coupling
**Example output:**
```json
{
"function": "calculate_total",
"variable_dependencies": ["items", "tax_rate", "discount", "total"],
"parameter_count": 3,
"local_var_count": 1
}
```
### Data Transformation Patterns
`DataFlowGraph` tracks data transformations between functions. The `TransformationType` enum (`src/organization/data_flow_analyzer.rs:35-46`) classifies transformations by their input/output cardinality:
```rust
pub enum TransformationType {
Direct, // A → B (pure transformation)
Aggregation, // (A, B) → C (multiple inputs to single output)
Decomposition, // A → (B, C) (single input to multiple outputs)
Enrichment, // A → Result<B> (validation/enrichment with Result/Option)
Expansion, // A → Vec<B> (single input to collection)
}
```
**Classification logic** (`src/organization/data_flow_analyzer.rs:124-146`):
- Multiple input parameters → `Aggregation`
- Return type is `Result<T>` or `Option<T>` → `Enrichment`
- Return type is `Vec<T>` → `Expansion`
- Return type is tuple → `Decomposition`
- Default → `Direct`
**Example usage:**
```rust
// Aggregation: (items, discount_rate) → f64
fn calculate_discounted_total(items: &[Item], discount_rate: f64) -> f64 {
items.iter().map(|i| i.price).sum::<f64>() * (1.0 - discount_rate)
}
// Enrichment: Config → Result<ValidatedConfig>
fn validate_config(config: Config) -> Result<ValidatedConfig> {
// ...
}
// Expansion: Order → Vec<LineItem>
fn extract_line_items(order: &Order) -> Vec<LineItem> {
order.items.clone()
}
```
**Note:** The `DataFlowGraph.data_transformations` field (`src/data_flow/mod.rs:149`) stores `transformation_type` as a `String`, allowing flexible pattern descriptions beyond the enum variants.
### I/O Operation Detection
Tracks functions performing I/O operations for purity and performance analysis:
**I/O categories tracked:**
- **File I/O**: `std::fs`, `File::open`, `read_to_string`
- **Network I/O**: HTTP requests, socket operations
- **Database I/O**: SQL queries, ORM operations
- **System calls**: Process spawning, environment access
- **Blocking operations**: `thread::sleep`, synchronous I/O in async
**Example detection:**
```rust
// Detected I/O operations: FileRead, FileWrite
fn save_config(config: &Config, path: &Path) -> Result<()> {
let json = serde_json::to_string(config)?; // No I/O
std::fs::write(path, json)?; // FileWrite detected
Ok(())
}
```
**I/O metadata:**
```json
{
"function": "save_config",
"io_operations": ["FileWrite"],
"is_blocking": true,
"affects_purity": true,
"async_safe": false
}
```
### Purity Analysis Integration
`DataFlowGraph` integrates with purity detection to provide comprehensive side effect analysis:
**Side effect tracking:**
- I/O operations (file, network, console)
- Global state mutations
- Random number generation
- System time access
- Non-deterministic behavior
**Purity confidence factors:**
- **1.0**: Pure mathematical function, no side effects
- **0.8**: Pure with deterministic data transformations
- **0.5**: Mixed - some suspicious patterns
- **0.2**: Likely impure - I/O detected
- **0.0**: Definitely impure - multiple side effects
**Example analysis:**
```json
{
"function": "calculate_discount",
"is_pure": true,
"purity_confidence": 0.95,
"side_effects": [],
"deterministic": true,
"safe_to_parallelize": true,
"safe_to_cache": true
}
```
### Modification Impact Analysis
`DataFlowGraph` calculates the impact of modifying a function:
```rust
pub struct ModificationImpact {
pub function_name: String,
pub affected_functions: Vec<String>, // Upstream callers
pub dependency_count: usize, // Downstream callees
pub has_side_effects: bool,
pub risk_level: RiskLevel,
}
```
**Risk level calculation:**
- **Critical**: Many upstream callers + side effects + low test coverage
- **High**: Many callers OR side effects with moderate coverage
- **Medium**: Few callers with side effects OR many callers with good coverage
- **Low**: Few callers, no side effects, or well-tested
**Example impact analysis:**
```json
{
"function": "validate_payment_method",
"modification_impact": {
"affected_functions": 4,
"dependency_count": 8,
"has_side_effects": true,
"risk_level": "High"
}
}
```
**Note**: The `affected_functions` field contains the count of upstream callers. The actual function names can be obtained from the `upstream_callers` field in the function metadata.
**Using modification impact:**
```bash
# Analyze impact before refactoring
**Impact analysis uses:**
- **Refactoring planning**: Understand blast radius before changes
- **Test prioritization**: Focus tests on high-impact functions
- **Code review**: Flag high-risk changes for extra scrutiny
- **Dependency management**: Identify tightly coupled components
### DataFlowGraph Methods
Key methods for data flow analysis:
```rust
// Add function with its dependencies
pub fn add_function(&mut self, function_id: String, callees: Vec<String>)
// Track variable dependencies
pub fn add_variable_dependency(&mut self, function_id: String, var_name: String)
// Record I/O operations
pub fn add_io_operation(&mut self, function_id: String, io_type: IoType)
// Calculate modification impact
pub fn calculate_modification_impact(&self, function_id: &str) -> ModificationImpact
// Get all functions affected by a change
pub fn get_affected_functions(&self, function_id: &str) -> Vec<String>
// Find functions with side effects
pub fn find_functions_with_side_effects(&self) -> Vec<String>
```
**Integration in analysis pipeline:**
1. Parser builds initial call graph
2. DataFlowGraph extends with variable/I/O tracking
3. Purity analyzer adds side effect information
4. Modification impact calculated for each function
5. Results used in prioritization and risk scoring
**Connection to Unified Scoring:**
The dependency analysis from DataFlowGraph directly feeds into the **unified scoring system's dependency factor** (20% weight):
- **Dependency Factor Calculation**: Functions with high upstream caller count or on critical paths from entry points receive higher dependency scores (8-10)
- **Isolated Utilities**: Functions with few or no callers score lower (1-3) on dependency factor
- **Impact Prioritization**: This helps prioritize functions where bugs have wider impact across the codebase
- **Modification Risk**: The modification impact analysis uses dependency data to calculate blast radius when changes are made
**Example:**
```
Function: validate_payment_method
Upstream callers: 4 (high impact)
→ Dependency Factor: 8.0
Function: format_currency_string
Upstream callers: 0 (utility)
→ Dependency Factor: 1.5
Both have same complexity, but validate_payment_method gets higher unified score
due to its critical role in the call graph.
```
This integration ensures that the unified scoring system considers not just internal function complexity and test coverage, but also the function's importance in the broader codebase architecture.
## Entropy-Based Complexity
Advanced pattern detection to reduce false positives.
**Token Classification:**
```rust
enum TokenType {
Variable, // Weight: 1.0
Method, // Weight: 1.5 (more important)
Literal, // Weight: 0.5 (less important)
Keyword, // Weight: 0.8
Operator, // Weight: 0.6
}
```
**Shannon Entropy Calculation:**
```
H(X) = -Σ p(x) × log₂(p(x))
```
where p(x) is the probability of each token type.
**Dampening Decision:**
```rust
if entropy_score.token_entropy < 0.4
&& entropy_score.pattern_repetition > 0.6
&& entropy_score.branch_similarity > 0.7
{
// Apply dampening
effective_complexity = base_complexity × (1 - dampening_factor);
}
```
**Output explanation:**
```
Function: validate_input
Cyclomatic: 15 → Effective: 5
Reasoning:
- High pattern repetition detected (85%)
- Low token entropy indicates simple patterns (0.32)
- Similar branch structures found (92% similarity)
- Complexity reduced by 67% due to pattern-based code
```
## Entropy Analysis Caching
`EntropyAnalyzer` includes an LRU-style cache for performance optimization when analyzing large codebases or performing repeated analysis.
### Cache Structure
```rust
struct CacheEntry {
score: EntropyScore,
timestamp: Instant,
hit_count: usize,
}
```
**Cache configuration:**
- **Default size**: 1000 entries
- **Eviction policy**: LRU (Least Recently Used)
- **Memory per entry**: ~128 bytes
- **Total memory overhead**: ~128 KB for default size
### Cache Statistics
The analyzer tracks cache performance:
```rust
pub struct CacheStats {
pub hits: usize,
pub misses: usize,
pub evictions: usize,
pub hit_rate: f64,
pub memory_usage: usize,
}
```
**Example stats output:**
```json
{
"entropy_cache_stats": {
"hits": 3427,
"misses": 1573,
"evictions": 573,
"hit_rate": 0.685,
"memory_usage": 128000
}
}
```
**Hit rate interpretation:**
- **> 0.7**: Excellent - many repeated analyses, cache is effective
- **0.4-0.7**: Good - moderate reuse, typical for incremental analysis
- **< 0.4**: Low - mostly unique functions, cache less helpful
### Performance Benefits
**Typical performance gains:**
- **Cold analysis**: 100ms baseline (no cache benefit)
- **Incremental analysis**: 30-40ms (~60-70% faster) for unchanged functions
- **Re-analysis**: 15-20ms (~80-85% faster) for recently analyzed functions
**Best for:**
- **Watch mode**: Analyzing on file save (repeated analysis of same files)
- **CI/CD**: Comparing feature branch to main (overlap in functions)
- **Large codebases**: Many similar functions benefit from pattern caching
**Memory estimation:**
```
Total cache memory = entry_count × 128 bytes
Examples:
- 1,000 entries: ~128 KB (default)
- 5,000 entries: ~640 KB (large projects)
- 10,000 entries: ~1.25 MB (very large)
```
### Cache Management
**Automatic eviction:**
- When cache reaches size limit, oldest entries evicted
- Hit count influences retention (frequently accessed stay longer)
- Timestamp used for LRU ordering
**Cache invalidation:**
- Function source changes invalidate entry
- Cache cleared between major analysis runs
- No manual invalidation needed
**Configuration (if exposed in future):**
```toml
[entropy.cache]
enabled = true
size = 1000 # Number of entries
ttl_seconds = 3600 # Optional: expire after 1 hour
```
## Context-Aware Analysis
Debtmap adjusts analysis based on code context:
**Pattern Recognition:**
- Validation patterns (repetitive checks)
- Dispatcher patterns (routing logic)
- Builder patterns (fluent APIs)
- Configuration parsers (key-value processing)
**Adjustment Strategies:**
- Reduce false positives for recognized patterns
- Apply appropriate thresholds by pattern type
- Consider pattern confidence in scoring
**Example:**
```rust
// Recognized as "validation_pattern"
// Complexity dampening applied
fn validate_user_input(input: &UserInput) -> Result<()> {
if input.name.is_empty() { return Err(Error::EmptyName); }
if input.email.is_empty() { return Err(Error::EmptyEmail); }
if input.age < 13 { return Err(Error::TooYoung); }
// ... more similar validations
Ok(())
}
```
## Coverage Integration
Debtmap parses LCOV coverage data for risk analysis:
**LCOV Support:**
- Standard format from most coverage tools
- Line-level coverage tracking
- Function-level aggregation
**Coverage Index:**
- O(1) exact name lookups (~0.5μs)
- O(log n) line-based fallback (~5-8μs)
- ~200 bytes per function
- Thread-safe (`Arc<CoverageIndex>`)
### Performance Characteristics
**Index Build Performance:**
- Index construction: O(n), approximately 20-30ms for 5,000 functions
- Memory usage: ~200 bytes per record (~2MB for 5,000 functions)
- Scales linearly with function count
**Lookup Performance:**
- Exact match (function name): O(1) average, ~0.5μs per lookup
- Line-based fallback: O(log n), ~5-8μs per lookup
- Cache-friendly data structure for hot paths
**Analysis Overhead:**
- Coverage integration overhead: ~2.5x baseline analysis time
- Target overhead: ≤3x (maintained through optimizations)
- Example timing: 53ms baseline → 130ms with coverage (2.45x overhead)
- Overhead includes index build + lookups + coverage propagation
**When to use coverage integration:**
- **Skip coverage** (faster iteration): For rapid development iteration or quick local checks, omit `--lcov` to get baseline results 2.5x faster
- **Include coverage** (comprehensive analysis): Use coverage integration for final validation, sprint planning, and CI/CD gates where comprehensive risk analysis is needed
**Thread Safety:**
- Coverage index wrapped in `Arc<CoverageIndex>` for lock-free parallel access
- Multiple analyzer threads can query coverage simultaneously
- No contention on reads, suitable for parallel analysis pipelines
**Memory Footprint:**
```
Total memory = (function_count × 200 bytes) + index overhead
Examples:
- 1,000 functions: ~200 KB
- 5,000 functions: ~2 MB
- 10,000 functions: ~4 MB
```
**Scalability:**
- Tested with codebases up to 10,000 functions
- Performance remains predictable and acceptable
- Memory usage stays bounded and reasonable
**Generating coverage:**
```bash
# Rust (using cargo-tarpaulin)
cargo tarpaulin --out lcov --output-dir target/coverage
# Or using cargo-llvm-cov
cargo llvm-cov --lcov --output-path target/coverage/lcov.info
```
**Using with Debtmap:**
```bash
debtmap analyze . --lcov target/coverage/lcov.info
```
**Coverage dampening:**
When coverage data is provided, debt scores are dampened for well-tested code:
```
final_score = base_score × (1 - coverage_percentage)
```
This ensures well-tested complex code gets lower priority than untested simple code.
## See Also
- [Complexity Metrics](complexity-metrics.md) - Detailed metrics used in analysis
- [Risk Scoring](risk-scoring.md) - How advanced features influence risk scores
- [Interpreting Results](interpreting-results.md) - Using analysis results effectively