valknut-rs 1.0.0

High-performance Rust implementation of valknut code analysis algorithms
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
# Valknut Architecture Documentation

This document describes the architecture, design patterns, and implementation details of the Valknut code analysis system.

## Table of Contents
- [Overview]#overview
- [System Architecture]#system-architecture
- [Core Components]#core-components
- [Analysis Pipeline]#analysis-pipeline
- [Language Support]#language-support
- [Data Flow]#data-flow
- [Performance Considerations]#performance-considerations
- [Extension Points]#extension-points
- [Design Decisions]#design-decisions

## Overview

Valknut is a high-performance code analysis tool implemented in Rust that provides comprehensive analysis capabilities including:

- **Structure Analysis**: Directory organization and file distribution assessment
- **Complexity Analysis**: AST-based complexity metrics using Tree-sitter parsers
- **Semantic Naming**: AI-powered function and variable name quality evaluation
- **Technical Debt Assessment**: Quantitative debt scoring and prioritization
- **Refactoring Recommendations**: Actionable improvement suggestions with impact analysis
- **Quality Gates**: CI/CD integration with configurable failure conditions

## System Architecture

### High-Level Architecture

```
┌─────────────────────┐    ┌─────────────────────┐    ┌─────────────────────┐
│      CLI Layer      │    │     API Layer       │    │   Configuration     │
│   (bin/valknut.rs)  │◄──►│   (api/*.rs)        │◄──►│   (valknut.yml)    │
└─────────────────────┘    └─────────────────────┘    └─────────────────────┘
           │                         │                          │
           ▼                         ▼                          ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                        Core Analysis Pipeline                               │
│                        (core/pipeline.rs)                                  │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│                          Detector Modules                                  │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐        │
│  │  Structure  │  │ Complexity  │  │ Refactoring │  │   Names     │        │
│  │  Analysis   │  │  Analysis   │  │  Analysis   │  │  Analysis   │        │
│  └─────────────┘  └─────────────┘  └─────────────┘  └─────────────┘        │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│                      Language Adapters                                     │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐        │
│  │   Python    │  │ TypeScript  │  │    Rust     │  │     Go      │        │
│  │   Parser    │  │   Parser    │  │   Parser    │  │   Parser    │        │
│  └─────────────┘  └─────────────┘  └─────────────┘  └─────────────┘        │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│                           I/O Layer                                        │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐        │
│  │   Reports   │  │    Cache    │  │ Persistence │  │   Config    │        │
│  │ Generation  │  │  System     │  │   Layer     │  │  Management │        │
│  └─────────────┘  └─────────────┘  └─────────────┘  └─────────────┘        │
└─────────────────────────────────────────────────────────────────────────────┘
```

## Core Components

### 1. Analysis Pipeline (`core/pipeline.rs`)

The central orchestrator that coordinates all analysis activities:

```rust
pub struct AnalysisPipeline {
    config: AnalysisConfig,
    complexity_analyzer: ComplexityAnalyzer,
    structure_extractor: StructureExtractor,
    refactoring_analyzer: RefactoringAnalyzer,
}

impl AnalysisPipeline {
    pub async fn analyze_paths(
        &self, 
        paths: &[PathBuf],
        progress_callback: Option<ProgressCallback>,
    ) -> Result<ComprehensiveAnalysisResult>
}
```

**Key Responsibilities:**
- File discovery and filtering
- Coordinating detector execution
- Progress tracking and reporting  
- Result aggregation and health metrics calculation
- Quality gate evaluation

### 2. Detector Modules (`detectors/`)

#### Structure Analyzer (`detectors/structure.rs`)
- Directory organization analysis
- File size and distribution assessment
- Reorganization recommendations

#### Complexity Analyzer (`detectors/complexity.rs`)
- AST-based complexity metrics
- Cyclomatic and cognitive complexity
- Maintainability index calculation

#### Semantic Naming Analyzer (`detectors/names/`)
- AI-powered name quality assessment
- Function and variable naming evaluation
- Renaming suggestions with context

#### Refactoring Analyzer (`detectors/refactoring.rs`)
- Code smell detection
- Improvement opportunity identification
- Impact assessment and prioritization

### 3. Language Adapters (`lang/`)

Language-specific parsers and AST analyzers using Tree-sitter:

```rust
pub trait LanguageAdapter {
    fn parse_file(&self, content: &str) -> Result<Tree>;
    fn extract_entities(&self, tree: &Tree) -> Vec<CodeEntity>;
    fn calculate_complexity(&self, node: &Node) -> ComplexityMetrics;
    fn analyze_names(&self, entities: &[CodeEntity]) -> Vec<NamingIssue>;
}
```

**Supported Languages:**
- **Python** (`lang/python.rs`) - Full support
- **TypeScript** (`lang/typescript.rs`) - Full support  
- **JavaScript** (`lang/javascript.rs`) - Full support
- **Rust** (`lang/rust_lang.rs`) - Full support
- **Go** (`lang/go.rs`) - Experimental

### 4. I/O Layer (`io/`)

#### Report Generation (`io/reports.rs`)
- Multiple output format support (JSON, HTML, Markdown, CSV)
- Template-based report generation
- Interactive dashboard creation

#### Caching System (`io/cache.rs`)
- File-based analysis result caching
- Incremental analysis support
- Cache invalidation strategies

#### Configuration Management (`core/config.rs`)
- YAML configuration parsing and validation
- Default value management
- CLI option integration

## Analysis Pipeline

### 1. File Discovery Phase

```rust
async fn discover_files(&self, paths: &[PathBuf]) -> Result<Vec<PathBuf>> {
    // 1. Traverse input paths
    // 2. Filter by file extensions
    // 3. Exclude specified directories  
    // 4. Apply file limits if configured
}
```

**Filtering Rules:**
- Include files matching configured extensions
- Exclude directories like `node_modules`, `target`, `.git`
- Respect max file limits for large codebases
- Handle both files and directories as input

### 2. Structure Analysis Phase

```rust
async fn run_structure_analysis(&self, paths: &[PathBuf]) -> Result<StructureAnalysisResults> {
    // 1. Analyze directory organization
    // 2. Identify overcrowded directories
    // 3. Detect large files needing splitting
    // 4. Generate reorganization recommendations
}
```

**Analysis Types:**
- **Directory Pressure**: Files/LOC per directory thresholds
- **File Size Analysis**: Large file identification and splitting suggestions
- **Balance Analysis**: Code distribution assessment
- **Partitioning Recommendations**: Structural improvement suggestions

### 3. Complexity Analysis Phase

```rust
async fn run_complexity_analysis(&self, files: &[PathBuf]) -> Result<ComplexityAnalysisResults> {
    // 1. Parse files using Tree-sitter
    // 2. Extract code entities (functions, classes)
    // 3. Calculate complexity metrics
    // 4. Identify complexity hotspots
}
```

**Metrics Calculated:**
- **Cyclomatic Complexity**: Decision point counting
- **Cognitive Complexity**: Human perception-based complexity
- **Technical Debt Score**: Quantified maintainability assessment
- **Maintainability Index**: Composite maintainability score

### 4. Semantic Analysis Phase

```rust
async fn run_semantic_analysis(&self, files: &[PathBuf]) -> Result<SemanticAnalysisResults> {
    // 1. Extract function and variable names
    // 2. Analyze naming quality using embedding models
    // 3. Detect naming inconsistencies
    // 4. Generate renaming suggestions
}
```

**AI Integration:**
- **Embedding Models**: Qwen3-Embedding-0.6B for semantic understanding
- **Mismatch Detection**: Contextual name quality assessment
- **Suggestion Generation**: Contextual renaming recommendations
- **API Protection**: Preserve public API naming conventions

### 5. Refactoring Analysis Phase

```rust
async fn run_refactoring_analysis(&self, files: &[PathBuf]) -> Result<RefactoringAnalysisResults> {
    // 1. Detect code smells and anti-patterns
    // 2. Identify improvement opportunities
    // 3. Calculate impact and effort estimates
    // 4. Prioritize recommendations
}
```

**Opportunities Identified:**
- Extract Method opportunities
- Extract Class candidates
- Reduce complexity suggestions
- Remove duplication recommendations

### 6. Health Metrics Calculation

```rust
fn calculate_health_metrics(&self, ...) -> HealthMetrics {
    // 1. Aggregate analysis results
    // 2. Calculate composite scores
    // 3. Normalize metrics to 0-100 scale
    // 4. Weighted health score calculation
}
```

**Health Score Components:**
- **Maintainability Score** (30% weight): Based on maintainability index
- **Structure Quality Score** (30% weight): Based on structural issues
- **Complexity Score** (20% weight): Inverse of complexity metrics  
- **Technical Debt Score** (20% weight): Inverse of debt ratio

## Language Support

### Tree-sitter Integration

Valknut uses Tree-sitter for robust, language-agnostic parsing:

```rust
use tree_sitter::{Language, Parser, Tree};

pub struct LanguageParser {
    language: Language,
    parser: Parser,
}

impl LanguageParser {
    pub fn parse(&mut self, source: &str) -> Option<Tree> {
        self.parser.parse(source, None)
    }
}
```

**Benefits:**
- **Error Recovery**: Robust parsing of incomplete/malformed code
- **Language Agnostic**: Uniform AST structure across languages
- **Performance**: Fast parsing with minimal memory usage
- **Incremental**: Support for incremental parsing (future)

### Language-Specific Implementations

#### Python Support (`lang/python.rs`)
- Function, class, and method extraction
- Import and module analysis
- Python-specific complexity patterns
- Django/Flask framework awareness

#### TypeScript/JavaScript Support (`lang/typescript.rs`, `lang/javascript.rs`)
- Modern syntax support (ES2024, TypeScript 5.x)
- React component analysis
- Module system understanding
- Node.js and browser pattern recognition

#### Rust Support (`lang/rust_lang.rs`)
- Trait and implementation analysis
- Ownership and borrowing pattern recognition
- Cargo project structure awareness
- Async/await pattern analysis

## Data Flow

### Input Processing
```
CLI Args → Configuration → Path Discovery → File Filtering
```

### Analysis Flow
```
Files → Language Detection → AST Parsing → Entity Extraction → 
Metric Calculation → Issue Detection → Recommendation Generation
```

### Output Generation
```
Analysis Results → Format Selection → Template Processing → 
Report Generation → File Output
```

### Caching Flow
```
Input Hash → Cache Check → [Cache Hit: Return | Cache Miss: Analyze → Cache Store]
```

## Performance Considerations

### Rust Performance Features

1. **Zero-Cost Abstractions**: Traits and generics compile to efficient code
2. **Memory Safety**: No garbage collection overhead
3. **SIMD Optimizations**: Vectorized operations for large datasets
4. **Parallel Processing**: Rayon for data parallelism

### Optimization Strategies

#### Parallel Analysis
```rust
use rayon::prelude::*;

files.par_iter()
    .map(|file| analyze_file(file))
    .collect::<Result<Vec<_>, _>>()
```

#### Memory Efficiency
```rust
// Stream processing for large files
use std::io::{BufReader, BufRead};

let reader = BufReader::new(file);
for line in reader.lines() {
    // Process line by line
}
```

#### Caching Strategy
- **Content-based hashing**: SHA-256 of file content + config
- **Hierarchical caching**: File, directory, and project level
- **TTL-based expiration**: Configurable cache lifetime
- **Selective invalidation**: Only invalidate affected cache entries

## Extension Points

### Adding New Languages

1. **Language Definition**:
```rust
pub struct NewLanguageAdapter {
    parser: LanguageParser,
    config: LanguageConfig,
}

impl LanguageAdapter for NewLanguageAdapter {
    // Implement required methods
}
```

2. **Register Language**:
```rust
pub fn register_languages() -> HashMap<String, Box<dyn LanguageAdapter>> {
    let mut languages = HashMap::new();
    languages.insert("newlang".to_string(), Box::new(NewLanguageAdapter::new()));
    languages
}
```

### Adding New Detectors

1. **Detector Implementation**:
```rust
pub struct CustomDetector {
    config: CustomConfig,
}

impl Detector for CustomDetector {
    async fn analyze(&self, entities: &[CodeEntity]) -> Result<DetectorResult> {
        // Custom analysis logic
    }
}
```

2. **Pipeline Integration**:
```rust
impl AnalysisPipeline {
    async fn run_custom_analysis(&self, files: &[PathBuf]) -> Result<CustomAnalysisResults> {
        let detector = CustomDetector::new(self.config.custom.clone());
        detector.analyze(&entities).await
    }
}
```

### Adding New Output Formats

1. **Format Implementation**:
```rust
pub struct CustomFormatter;

impl ReportFormatter for CustomFormatter {
    fn format(&self, results: &ComprehensiveAnalysisResult) -> Result<String> {
        // Custom formatting logic
    }
}
```

2. **Registration**:
```rust
pub fn get_formatter(format: &OutputFormat) -> Box<dyn ReportFormatter> {
    match format {
        OutputFormat::Custom => Box::new(CustomFormatter),
        // ... other formats
    }
}
```

## Design Decisions

### ADR-001: Rust Implementation Choice

**Status**: Accepted

**Context**: Need for high-performance code analysis tool capable of handling large codebases.

**Decision**: Implement in Rust for performance, memory safety, and ecosystem benefits.

**Consequences**:
- ✅ Excellent performance and memory efficiency
- ✅ Memory safety without garbage collection overhead
- ✅ Rich ecosystem of parsing and analysis crates
- ❌ Higher learning curve for contributors
- ❌ Longer compilation times during development

### ADR-002: Tree-sitter for Parsing

**Status**: Accepted

**Context**: Need robust, language-agnostic parsing for multiple programming languages.

**Decision**: Use Tree-sitter for all language parsing.

**Consequences**:
- ✅ Uniform AST structure across languages
- ✅ Error-tolerant parsing of incomplete code
- ✅ High performance with incremental parsing
- ✅ Large ecosystem of language grammars
- ❌ Additional dependency on Tree-sitter binaries
- ❌ Learning curve for Tree-sitter query syntax

### ADR-003: Multi-Stage Analysis Pipeline

**Status**: Accepted

**Context**: Need to coordinate multiple types of analysis with progress tracking.

**Decision**: Implement pipeline pattern with distinct analysis stages.

**Consequences**:
- ✅ Clear separation of concerns
- ✅ Easy to add new analysis types
- ✅ Progress tracking and error isolation
- ✅ Configurable analysis stages
- ❌ Potential for redundant file processing
- ❌ More complex coordination logic

### ADR-004: Configuration-First Design

**Status**: Accepted

**Context**: Need flexible configuration for different project types and team standards.

**Decision**: Make all analysis behavior configurable through YAML configuration.

**Consequences**:
- ✅ Flexible adaptation to different codebases
- ✅ Team-specific standard enforcement
- ✅ CLI and config file integration
- ✅ Validation and documentation
- ❌ Configuration complexity for users
- ❌ Default configuration maintenance

### ADR-005: Quality Gates for CI/CD

**Status**: Accepted

**Context**: Need automated quality control integration with development workflows.

**Decision**: Implement configurable quality gates with exit codes for CI/CD.

**Consequences**:
- ✅ Automated quality enforcement
- ✅ Configurable failure conditions
- ✅ Integration with existing CI/CD systems
- ✅ Gradual quality improvement enforcement
- ❌ Potential for overly strict gates blocking development
- ❌ Configuration complexity for optimal thresholds

## Future Considerations

### Planned Enhancements

1. **Incremental Analysis**: Only analyze changed files
2. **Language Server Protocol**: IDE integration for real-time analysis
3. **Machine Learning**: Improved pattern recognition and recommendations
4. **Distributed Analysis**: Support for analyzing very large codebases
5. **Custom Rules**: User-defined analysis rules and patterns
6. **Integration APIs**: REST API for third-party tool integration

### Scalability Plans

1. **Database Backend**: Store analysis results in database for large projects
2. **Caching Strategy**: Distributed caching for team environments
3. **Parallel Execution**: Multi-machine analysis coordination
4. **Memory Management**: Streaming analysis for memory-constrained environments

This architecture provides a solid foundation for comprehensive code analysis while maintaining flexibility for future enhancements and extensions.