deepwiki-rs 1.5.0

deepwiki-rs(also known as Litho) is a high-performance automatic generation engine for C4 architecture documentation, developed using Rust. It can intelligently analyze project structures, identify core components, parse dependency relationships, and leverage large language models (LLMs) to automatically generate professional architecture documentation.
**Technical Documentation: Preprocessing Domain**

**Version:** 1.0  
**System:** deepwiki-rs  
**Classification:** Core Business Domain  
**Last Updated:** 2026-02-01 06:42:41 (UTC)

---

## 1. Overview

The **Preprocessing Domain** constitutes the foundational stage of the deepwiki-rs documentation generation pipeline. As the first executable phase in the four-stage workflow (Preprocessing → Research → Composition → Output), this domain is responsible for transforming raw source code repositories into structured, machine-readable analytical artifacts.

The domain implements a hybrid static-and-AI analysis strategy to extract project topology, code dependencies, interface definitions, and architectural relationships from 12+ programming languages. Its primary output—`Vec<CodeInsight>`—serves as the critical data substrate for downstream Research Domain agents performing C4-level architectural analysis.

**Key Characteristics:**
- **Language Agnostic**: Trait-based abstraction supporting Rust, Java, Python, JavaScript, TypeScript, C#, PHP, Kotlin, Swift, React, Vue, and Svelte
- **Hybrid Analysis**: Combines high-performance static regex parsing with LLM-enhanced semantic analysis
- **Parallel Execution**: Controlled concurrency using Tokio async runtime with semaphore-based resource limiting
- **Stateful Pipeline**: Persists results to scoped memory (`MemoryScope::PREPROCESS`) for cross-domain data transfer

---

## 2. Architectural Position

Within the system's Domain-Driven Design (DDD) architecture, the Preprocessing Domain resides in the **Core Business Domain** layer. It maintains strict upstream dependencies on Infrastructure Domains (Configuration, LLM Integration, Caching) and downstream data contracts with the Research Domain.

```mermaid
graph LR
    A[Configuration<br/>Management] -->|initializes| B[Core Generation<br/>Domain]
    B -->|orchestrates| C[Preprocessing<br/>Domain]
    C -->|produces| D[MemoryScope::<br/>PREPROCESS]
    D -->|consumes| E[Research<br/>Domain]
    
    F[LLM Integration] -.->|enhances| C
    G[Caching Domain] -.->|optimizes| C
    
    style C fill:#e8f5e9,stroke:#1b5e20,stroke-width:2px
    style D fill:#fff3e0,stroke:#e65100
```

**Execution Pattern:**  
Unlike the Research and Composition domains that utilize the declarative `StepForwardAgent` trait framework, Preprocessing employs an **imperative execution pattern**. Agents in this domain directly invoke the `AgentExecutor` for fine-grained control over data transformation tasks that require complex filtering, parallelization, and staged persistence logic.

---

## 3. Core Components

### 3.1 PreProcessAgent (Orchestrator)
The central coordinator implementing the `Generator<PreprocessingResult>` trait. It executes a deterministic six-step workflow with structured progress logging and performance instrumentation.

**Key Responsibilities:**
- Pipeline sequencing and error boundary management
- Resource allocation (parallelism limits via `config.llm.max_parallels`)
- Cross-component data marshalling
- Final result aggregation and memory persistence

### 3.2 StructureExtractor
Performs recursive directory traversal with intelligent file filtering. Implements the `BoxFuture` pattern for async directory walking while respecting `.gitignore` semantics and exclusion patterns.

**Capabilities:**
- Importance scoring algorithm (multi-factor heuristic based on path depth, naming conventions, file size, and extension type)
- Hierarchical project structure serialization
- Integration with `LanguageProcessorManager` for early-stage language detection

### 3.3 Language Processing Subsystem
A trait-based strategy pattern implementation providing unified analysis across heterogeneous codebases.

**Architecture:**
- **Trait Definition**: `LanguageProcessor` with methods for `extract_dependencies()`, `extract_interfaces()`, `determine_component_type()`, and `complexity_metrics()`
- **Facade Pattern**: `LanguageProcessorManager` routes files to appropriate concrete processors based on extension matching
- **Concrete Implementations**: 12+ specialized processors utilizing pre-compiled regex patterns for efficient static analysis (e.g., `^\\s*import\\s+([^;]+);` for Java import extraction)

### 3.4 Analysis Agents

**CodePurposeEnhancer**  
Hybrid classification agent combining rule-based heuristics with AI fallback. Employs confidence thresholding (0.7) to determine whether to use LLM classification or deterministic pattern matching for file purpose identification.

**CodeAnalyze Agent**  
Implements two-phase analysis:
1. **Static Phase**: Regex-based extraction of imports, exports, function signatures, and complexity metrics
2. **AI Phase**: LLM enhancement using `extract::<CodeInsight>()` for semantic responsibilities and architectural role classification

**RelationshipsAnalyze Agent**  
Project-level architectural analysis component performing:
- Importance-filtered insight aggregation (retains scores ≥ 0.6, top 150 insights)
- Prompt compression via `PromptCompressor` to manage LLM token constraints
- Dependency graph generation with circular dependency detection

---

## 4. Processing Workflow

The Preprocessing Domain executes a strictly sequential six-step pipeline:

```mermaid
flowchart TD
    Start([Start]) --> S1[1. Original Document Extraction<br/>README.md, CONTRIBUTING.md]
    S1 --> S2[2. Structure Extraction<br/>Recursive Directory Traversal]
    S2 --> S3[3. Core File Identification<br/>Importance Scoring > 0.5]
    S3 --> S4[4. AI Code Analysis<br/>Parallel Two-Phase Analysis]
    S4 --> S5[5. Relationship Analysis<br/>Dependency Graph Generation]
    S5 --> S6[6. Memory Persistence<br/>PREPROCESSING Scope Storage]
    S6 --> End([End])
    
    S2 -->|uses| StructureExtractor
    S3 -->|uses| CodePurposeEnhancer
    S4 -->|uses| CodeAnalyze
    S4 -->|dispatches| LanguageProcessors
    S5 -->|uses| RelationshipsAnalyze
    
    style S4 fill:#e1f5fe,stroke:#01579b,stroke-width:2px
    style S5 fill:#e1f5fe,stroke:#01579b,stroke-width:2px
```

### Step 1: Original Document Extraction
Extracts high-level project metadata from `README.md`, `CONTRIBUTING.md`, and other documentation files using `OriginalDocumentExtractor`. Preserves original structure for later composition stages.

### Step 2: Structure Extraction
Recursively traverses the project directory, calculating importance scores for each file. Files scoring above 0.5 are flagged as `is_core` and prioritized for deep analysis. Filters exclude binary files, test directories, and hidden configuration files.

### Step 3: Core File Identification
Enhances file metadata with functional classification (e.g., Controller, Service, Model, Utility) using the `CodePurposeEnhancer`. Rule-based classification executes first; ambiguous cases escalate to LLM analysis.

### Step 4: AI Code Analysis (Parallel Execution)
The most computationally intensive phase. Executes `do_parallel_with_limit()` to process core files concurrently within configured resource constraints (`max_parallels`).

For each file:
- Language-specific processor extracts static dependencies and interfaces
- `AgentExecutor::extract::<CodeInsight>()` generates semantic analysis including:
  - Functional responsibilities
  - Architectural patterns detected
  - Complexity metrics (cyclomatic complexity, LOC)
  - Documentation comments (Javadoc, JSDoc, RustDoc)

### Step 5: Relationship Analysis
Aggregates individual file insights into project-level architectural views. Implements token management through:
- Importance-based filtering (excludes low-relevance files from context)
- Content compression strategies for large codebases
- Structured extraction of `RelationshipAnalysis` (dependency graphs, module boundaries)

### Step 6: Memory Persistence
Persists four primary data structures to `MemoryScope::PREPROCESS`:
- `PROJECT_STRUCTURE`: Hierarchical file tree with metadata
- `CODE_INSIGHTS`: Vector of analyzed code dossiers
- `RELATIONSHIPS`: Dependency and architectural relationship graphs
- `ORIGINAL_DOCUMENT`: Raw documentation extraction

---

## 5. Technical Implementation Details

### 5.1 Concurrency Model
Implements controlled parallelism to prevent resource exhaustion during LLM API calls:

```rust
// Pattern from implementation
do_parallel_with_limit(
    codes_to_analyze,
    context.config.llm.max_parallels,
    |code| async move {
        // Analysis logic with cloned context
    }
).await
```

**Thread Safety:**  
Components utilize `Arc<RwLock<T>>` for shared state access. `LanguageProcessorManager` implements `Clone` by recreating processor instances, ensuring thread-safe dispatch without cross-contamination of parser state.

### 5.2 Prompt Compression Strategy
To accommodate LLM context window limitations during relationship analysis:

1. **Filtering**: Retains only insights with `importance_score >= 0.6`
2. **Truncation**: Limits to top 150 insights and 20 dependencies per file
3. **Semantic Compression**: `PromptCompressor` with configurable `CompressionConfig` reduces token count while preserving architectural significance

### 5.3 Importance Scoring Algorithm
Multi-factor heuristic determining analysis priority:

```
Importance = f(location, naming, size, extension, content_type)

Factors:
- Location: src/, lib/ paths weighted higher (0.3)
- Naming: main.*, index.*, mod.rs receive bonus (0.25)
- Size: Optimal range 1KB-50KB (diminishing returns outside range)
- Extension: Core languages weighted 0.3 vs. configuration files
- Database: SQL-related paths flagged for conditional analysis
```

Files exceeding threshold 0.5 receive full AI analysis; others receive static analysis only.

### 5.4 Data Contracts

**PreprocessingResult** (Output DTO):
```rust
pub struct PreprocessingResult {
    pub original_document: OriginalDocument,
    pub project_structure: ProjectStructure,
    pub code_insights: Vec<CodeInsight>,
    pub relationships: RelationshipAnalysis,
}
```

**CodeInsight** (Core Artifact):
Aggregates static analysis (`CodeDossier`) with AI-generated semantic understanding:
- Interfaces (public APIs, exports)
- Dependencies (imports, external references)
- Complexity metrics (cyclomatic, cognitive)
- Responsibilities (natural language description of purpose)
- Component type classification (Domain Service, Repository, Controller, etc.)

---

## 6. Integration Interfaces

### 6.1 Upstream Dependencies
The domain receives execution context via the `Generator` trait:

```rust
pub trait Generator<T> {
    async fn execute(&self, context: GeneratorContext) -> Result<T>;
}
```

**GeneratorContext Provides:**
- `config`: Execution parameters (parallelism limits, exclusion patterns)
- `llm_client`: Arc-wrapped LLM client for AI operations
- `cache_manager`: Response caching for structure extraction
- `memory`: Scoped storage for result persistence

### 6.2 Downstream Consumption
Research Domain agents retrieve preprocessing results via memory scope access:

```rust
// Pattern used by Research agents
let insights: Vec<CodeInsight> = context
    .memory
    .get_scoped(MemoryScope::PREPROCESS, ScopedKeys::CODE_INSIGHTS)
    .await?;
```

**Contract Stability:**  
The `CodeInsight` schema serves as the stable interface between Preprocessing and Research domains. Changes to this structure require synchronized updates across both domains.

---

## 7. Configuration Parameters

Key configuration values affecting Preprocessing behavior:

| Parameter | Domain | Description | Default |
|-----------|--------|-------------|---------|
| `max_parallels` | LLM | Concurrent file analysis limit | 5 |
| `max_depth` | Preprocessing | Maximum directory traversal depth | Unlimited |
| `excluded_dirs` | Preprocessing | Patterns to exclude (e.g., `tests/`, `target/`) | `[".git", "node_modules"]` |
| `importance_threshold` | Preprocessing | Minimum score for core file designation | 0.5 |
| `ai_confidence_threshold` | Preprocessing | Minimum confidence for AI classification | 0.7 |

---

## 8. Error Handling and Resilience

**Fail-Fast Strategy:**  
Critical errors in code analysis (e.g., LLM API failures, parsing panics) propagate immediately via `anyhow::Result` to halt the pipeline, preventing corrupted state from reaching Research agents.

**Graceful Degradation:**
- Language processors fall back to generic text analysis for unsupported file types
- AI classification falls back to rule-based heuristics when confidence is insufficient
- Individual file analysis failures do not cascade (logged and skipped)

**Resource Protection:**
- Semaphore-based concurrency prevents LLM rate limit violations
- Token estimation prevents prompt overflow before LLM invocation
- Timeout handling on file system operations prevents hanging on deep directory structures

---

## 9. Performance Considerations

**Optimization Strategies:**
1. **Caching**: Structure extraction results cached via MD5 hash of directory state
2. **Lazy Evaluation**: Language processors only parse files flagged as core (importance > 0.5)
3. **Parallel I/O**: Async file operations throughout; blocking syscalls offloaded to `tokio::task::spawn_blocking` where necessary
4. **Memory Efficiency**: Streaming JSON serialization for large `CodeInsight` vectors; prompt compression reduces LLM token costs by 40-60%

**Bottlenecks:**
- LLM API latency dominates execution time (mitigated by parallelization)
- Deep recursive directories with thousands of files impact memory usage (mitigated by importance filtering)
- Regex compilation overhead in language processors (mitigated by lazy static initialization)

---

## 10. Extension Points

**Adding Language Support:**
1. Implement `LanguageProcessor` trait for new language
2. Register processor in `LanguageProcessorManager` extension map
3. Define file extension mappings and regex patterns for import/interface extraction

**Custom Analysis Agents:**
New analysis stages can be inserted between Step 4 and Step 5 by:
1. Implementing agent struct with `execute(context, inputs)` interface
2. Adding invocation in `PreProcessAgent::execute` workflow
3. Defining new `ScopedKeys` constant for memory persistence

---

**End of Document**