deepwiki-rs 1.3.0

# Core Workflows

## 1. Workflow Overview

### 1.1 System Architecture and Workflow Philosophy

The **deepwiki-rs** system implements a sophisticated multi-agent AI pipeline for automated software documentation generation. The workflow architecture follows a **staged pipeline pattern** with clear separation of concerns across four primary execution phases: **Preprocessing**, **Research**, **Composition**, and **Output**. 

The system employs a **C4 Model abstraction hierarchy** to ensure architectural analysis progresses from high-level system context (C1) through containers and components (C2) to detailed code-level analysis (C3-C4). This hierarchical approach ensures that detailed technical analysis is always informed by broader architectural context.

### 1.2 Core Execution Paths

```mermaid
flowchart LR
    subgraph Input["Input Layer"]
        A[CLI Arguments] --> B[Configuration]
        C[Target Project] --> D[Knowledge Base]
    end
    
    subgraph Processing["Processing Layer"]
        E[Preprocessing] --> F[Research]
        F --> G[Composition]
    end
    
    subgraph Output["Output Layer"]
        G --> H[Document Persistence]
        H --> I[Summary Reports]
    end
    
    Input --> Processing
    Processing --> Output
    
    style Processing fill:#e3f2fd,stroke:#1976d2
    style Input fill:#e8f5e9,stroke:#388e3c
    style Output fill:#fff3e0,stroke:#f57c00
```

### 1.3 Key Process Nodes

| Process Node | Domain | Responsibility | Output |
|-------------|--------|---------------|---------|
| **CLI Entry** | Configuration Management | Argument parsing, config hierarchy resolution | Config, GeneratorContext |
| **PreProcessAgent** | Preprocessing Domain | 6-step analysis pipeline | CodeInsights, ProjectStructure |
| **ResearchOrchestrator** | Research Domain | 8-agent architectural analysis | Research Reports (C1-C4) |
| **DocumentationComposer** | Composition Domain | 6-editor documentation generation | Markdown Sections |
| **DiskOutlet** | Output Domain | Persistence and post-processing | Documentation Artifacts |

### 1.4 Process Coordination Mechanisms

The system utilizes three primary coordination mechanisms:

1. **Hierarchical Memory Scopes**: Typed storage across `PREPROCESSING`, `STUDIES_RESEARCH`, and `DOCUMENTATION` scopes enables stateful inter-agent communication
2. **Dependency-Aware Orchestration**: Agents execute following a directed acyclic graph (DAG) of data dependencies
3. **Resource-Constrained Parallelism**: CPU-bound and IO-bound operations utilize `do_parallel_with_limit` to respect LLM rate limits and system resources

---

## 2. Main Workflows

### 2.1 End-to-End Documentation Generation Workflow

This is the primary business process orchestrating the complete lifecycle from project ingestion to documentation delivery.

#### 2.1.1 Process Flow

```mermaid
flowchart TD
    Start([CLI Invocation]) --> Init[Initialize GeneratorContext<br/>LLM Client, Cache, Memory]
    Init --> KnowledgeSync{External Knowledge<br/>Configured?}
    KnowledgeSync -->|Yes| Sync[KnowledgeSyncer<br/>Document Ingestion & Chunking]
    KnowledgeSync -->|No| Preprocess
    Sync --> Preprocess[PreProcessAgent<br/>6-Step Analysis]
    
    Preprocess --> Research[ResearchOrchestrator<br/>Multi-Agent Pipeline]
    Research --> Compose[DocumentationComposer<br/>Editor Pipeline]
    Compose --> Output[Outlet Layer<br/>Persistence & Fixing]
    Output --> Summary[SummaryOutlet<br/>Performance Reports]
    Summary --> End([Documentation Artifacts])
    
    subgraph PreprocessStage["Stage 1: Preprocessing"]
        direction TB
        P1[README Extraction] --> P2[Structure Analysis]
        P2 --> P3[Core File Identification]
        P3 --> P4[Language Processing]
        P4 --> P5[AI Code Analysis]
        P5 --> P6[Relationship Analysis]
    end
    
    subgraph ResearchStage["Stage 2: Research (C1-C4)"]
        direction TB
        R1[SystemContextResearcher] --> R2[DomainModulesDetector]
        R2 --> R3[Parallel Specialized Agents]
        R3 --> R3a[ArchitectureResearcher]
        R3 --> R3b[WorkflowResearcher]
        R3 --> R3c[KeyModulesInsight]
        R3 --> R3d[BoundaryAnalyzer]
        R3 --> R3e[DatabaseOverviewAnalyzer]
    end
    
    Preprocess --> PreprocessStage
    ResearchStage --> Research
    
    style Init fill:#e1f5fe
    style End fill:#e8f5e9
    style PreprocessStage fill:#fff3e0
    style ResearchStage fill:#fce4ec
```

#### 2.1.2 Detailed Process Steps

**Step 1: Context Initialization**
- **Input**: CLI arguments, configuration files (`litho.toml`), environment variables
- **Process**: 
  - `Config` resolution with multi-format project name inference (Cargo.toml, package.json, pom.xml, .csproj)
  - `GeneratorContext` construction aggregating LLM client, CacheManager, Memory system
  - Token estimator and performance monitor initialization
- **Output**: Initialized `GeneratorContext` with `Arc<RwLock<T>>` protected shared state

**Step 2: Knowledge Synchronization**
- **Trigger**: Conditional execution based on `knowledge_base` configuration
- **Process**:
  - Change detection via file mtime comparison and HashSet symmetric difference
  - Multi-format document processing (PDF, Markdown, SQL, YAML, JSON)
  - Intelligent chunking: Semantic (Markdown/SQL-aware), Paragraph-based, or Fixed-size with overlap
  - Category organization (architecture, database, API, ADR)
- **Output**: Cached, chunked documents available for RAG-style retrieval

**Step 3: Preprocessing Pipeline**
- **Process**: Six sequential sub-steps:
  1. **Original Document Extraction**: README.md discovery and content extraction
  2. **Structure Extraction**: Recursive directory traversal with intelligent filtering (excludes build artifacts, node_modules, .git)
  3. **Core File Identification**: Importance scoring algorithm identifying "core" code files vs. boilerplate
  4. **Language Processing**: Extension-based dispatch to 12+ `LanguageProcessor` implementations
  5. **AI Code Analysis**: Two-phase analysis (static extraction → AI enhancement) with controlled parallelism
  6. **Relationship Analysis**: Project-level architectural dependency analysis with prompt compression
- **Output**: `PreprocessingResult` containing `Vec<CodeInsight>`, `ProjectStructure`, `RelationshipAnalysis`

**Step 4: Multi-Agent Research**
- **Process**: Three-layer C4 analysis pipeline (detailed in section 2.2)
- **Key Characteristic**: Staged execution ensuring C1 context informs C2 analysis, which informs C3-C4 deep dives

**Step 5: Documentation Composition**
- **Process**: Sequential execution of specialized editors:
  1. `OverviewEditor`: C4 System Context documentation
  2. `ArchitectureEditor`: C4 Container/Component/Code views with Mermaid diagrams
  3. `WorkflowEditor`: Process documentation (the output of this current analysis)
  4. `KeyModulesInsightEditor`: Concurrent module deep-dives
  5. `BoundaryEditor`: CLI/API/Router interface documentation
  6. `DatabaseEditor`: Conditional SQL schema documentation (executes only if `.sql` files detected)
- **Output**: Populated `DocTree` with file paths mapped to generated content

**Step 6: Output and Persistence**
- **Process**:
  - Directory structure creation with internationalized naming (`4.Deep-Exploration` vs `4、深入探索`)
  - Content retrieval from `MemoryScope::DOCUMENTATION`
  - Mermaid diagram syntax fixing via external `mermaid-fixer` tool
  - Summary report generation (Full and Brief modes)
- **Output**: Markdown files on disk, performance metrics report

### 2.2 Multi-Agent Research Pipeline

This workflow implements the core intellectual property of the system—AI-powered architectural analysis following C4 model abstraction levels.

#### 2.2.1 C4 Abstraction Levels

```mermaid
flowchart TD
    subgraph C1["C1: System Context"]
        A[SystemContextResearcher]
        A --> |Produces| A1[SystemContextReport<br/>Business Value, Users, External Systems]
    end
    
    subgraph C2["C2: Domain & Architecture"]
        B[DomainModulesDetector] --> |Produces| B1[DomainModulesReport<br/>DDD Bounded Contexts]
        B --> C[ArchitectureResearcher]
        B --> D[WorkflowResearcher]
        C --> |Produces| C1[ArchitectureReport<br/>Containers, Components]
        D --> |Produces| D1[WorkflowReport<br/>Business Flows]
    end
    
    subgraph C3C4["C3-C4: Component & Code"]
        E[KeyModulesInsight] --> |Produces| E1[KeyModuleReport<br/>Technical Deep Dive]
        F[BoundaryAnalyzer] --> |Produces| F1[BoundaryAnalysisReport<br/>Interfaces]
        G[DatabaseOverviewAnalyzer] --> |Produces| G1[DatabaseOverviewReport<br/>Schema]
    end
    
    A --> B
    C2 --> C3C4
    
    style C1 fill:#e3f2fd
    style C2 fill:#e8f5e9
    style C3C4 fill:#fff3e0
```

#### 2.2.2 Agent Dependencies and Data Flow

```mermaid
sequenceDiagram
    participant RO as ResearchOrchestrator
    participant SC as SystemContextResearcher
    participant DM as DomainModulesDetector
    participant AR as ArchitectureResearcher
    participant WR as WorkflowResearcher
    participant KM as KeyModulesInsight
    participant BA as BoundaryAnalyzer
    participant DA as DatabaseOverviewAnalyzer
    participant Mem as Memory Store
    
    RO->>SC: execute()
    SC->>Mem: store SystemContextReport
    RO->>DM: execute()
    Note over DM: Requires SystemContextReport
    DM->>Mem: store DomainModulesReport
    
    par Parallel C2 Analysis
        RO->>AR: execute()
        Note over AR: Requires DomainModulesReport
        AR->>Mem: store ArchitectureReport
    and
        RO->>WR: execute()
        Note over WR: Requires DomainModulesReport
        WR->>Mem: store WorkflowReport
    end
    
    par Parallel C3-C4 Analysis
        RO->>KM: execute()
        Note over KM: Requires DomainModulesReport<br/>Parallel per domain
        KM->>Mem: store Vec<KeyModuleReport>
    and
        RO->>BA: execute()
        Note over BA: Requires DomainModulesReport<br/>Filters CODE_INSIGHTS
        BA->>Mem: store BoundaryAnalysisReport
    and
        RO->>DA: execute()
        Note over DA: Conditional: SQL files exist
        DA->>Mem: store DatabaseOverviewReport
    end
```

#### 2.2.3 Agent Specializations

| Agent | C4 Level | Input Data Sources | Output | Key Function |
|-------|----------|-------------------|--------|--------------|
| **SystemContextResearcher** | C1 | ProjectStructure, CodeInsights, README | SystemContextReport | Identifies business value, stakeholders, external dependencies |
| **DomainModulesDetector** | C2 | SystemContextReport, CodeInsights | DomainModulesReport | DDD domain decomposition, bounded contexts |
| **ArchitectureResearcher** | C2 | DomainModulesReport, External Knowledge | ArchitectureReport | C4 diagrams, architectural patterns, drift detection |
| **WorkflowResearcher** | C2 | DomainModulesReport, CodeInsights | WorkflowReport | Business process flows, execution paths |
| **KeyModulesInsight** | C3-C4 | DomainModulesReport, filtered CodeInsights | Vec<KeyModuleReport> | Parallel technical deep-dive per domain |
| **BoundaryAnalyzer** | C3-C4 | CodeInsights (Entry/Api/Controller) | BoundaryAnalysisReport | CLI, API, Router interface extraction |
| **DatabaseOverviewAnalyzer** | C3-C4 | SQL files, DAO code | DatabaseOverviewReport | Schema, procedures, ER diagrams |

### 2.3 Static Code Analysis Workflow

This workflow transforms raw source code into structured `CodeInsight` objects through language-agnostic processing and AI enhancement.

#### 2.3.1 Two-Phase Analysis Architecture

```mermaid
flowchart LR
    subgraph Phase1["Phase 1: Static Analysis"]
        A[LanguageProcessor<br/>Trait Implementations] --> B[Dependency Extraction]
        A --> C[Interface Extraction]
        A --> D[Component Classification]
        A --> E[Documentation Parsing]
    end
    
    subgraph Phase2["Phase 2: AI Enhancement"]
        F[CodeAnalyze Agent] --> G[Responsibility Analysis]
        F --> H[Pattern Recognition]
        F --> I[Architectural Role]
    end
    
    Phase1 --> Phase2
    
    style Phase1 fill:#e3f2fd
    style Phase2 fill:#fff3e0
```

#### 2.3.2 Language Processing Matrix

The `LanguageProcessorManager` dispatches files based on extensions to specialized processors:

| Language/Framework | Processor | Key Capabilities |
|-------------------|-----------|------------------|
| Rust | `RustProcessor` | use/mod statements, unsafe blocks, trait/impl analysis |
| Java | `JavaProcessor` | Import/package extraction, Javadoc parsing, annotation detection |
| Python | `PythonProcessor` | Import analysis, docstring extraction, type annotation parsing |
| JavaScript/TypeScript | `TypeScriptProcessor` | ES6/CommonJS imports, JSDoc, async/await detection |
| C# | `CsharpProcessor` | Namespace/usings, XML documentation, attribute analysis |
| PHP | `PhpProcessor` | Namespace/use, DocBlock, Composer dependency detection |
| Swift | `SwiftProcessor` | Import attributes, optionals, generic type parsing |
| Kotlin | `KotlinProcessor` | Android-specific detection (Activity/ViewModel), coroutine support |
| React | `ReactProcessor` | Hook detection, JSX analysis, component hierarchy |
| Vue | `VueProcessor` | `<script>` extraction, Composition API detection |
| Svelte | `SvelteProcessor` | Reactive statements, store subscriptions |

#### 2.3.3 Parallel Processing Control

The `CodeAnalyze` agent implements controlled concurrency:

```rust
// Conceptual workflow based on code analysis
do_parallel_with_limit(
    files_to_analyze,
    max_parallels,  // From config.llm.max_concurrent_requests
    |file| async {
        // 1. Static analysis via LanguageProcessor
        let static_insight = processor.analyze(file);
        
        // 2. Build context-aware prompt
        let prompt = build_prompt(project_context, static_insight);
        
        // 3. LLM extraction
        let ai_insight = agent_executor::extract::<CodeInsight>(prompt).await;
        
        // 4. Merge static + AI results
        merge_insights(static_insight, ai_insight)
    }
).await;
```

**Token Management Strategy**:
- **Truncation**: Large files truncated to prevent prompt overflow (DatabaseOverviewAnalyzer limits to 50 most important files)
- **Compression**: `PromptCompressor` utility reduces content size while preserving semantic meaning
- **Filtering**: Importance score thresholding (0.6) filters low-value files from relationship analysis

### 2.4 Documentation Composition Workflow

Transforms structured research data into human-readable Markdown following C4 model standards.

#### 2.4.1 Editor Agent Pipeline

```mermaid
flowchart TD
    Start([Research Complete]) --> OE[OverviewEditor<br/>C4 System Context]
    OE --> AE[ArchitectureEditor<br/>C4 Container/Component]
    AE --> WE[WorkflowEditor<br/>This Document]
    WE --> KM[KeyModulesInsightEditor<br/>Concurrent Module Docs]
    KM --> BE[BoundaryEditor<br/>Interface Specs]
    
    BE --> DB{Database Files?}
    DB -->|Yes| DE[DatabaseEditor<br/>Schema Docs]
    DB -->|No| Persist
    DE --> Persist[Persist to Memory<br/>DOCUMENTATION Scope]
    
    Persist --> End([DocTree Updated])
    
    style OE fill:#e3f2fd
    style AE fill:#e8f5e9
    style WE fill:#fff3e0
    style KM fill:#fce4ec
    style BE fill:#f3e5f5
```

#### 2.4.2 KeyModulesInsightEditor Concurrency Model

This editor implements a two-level architecture for parallel documentation generation:

1. **Orchestrator Level** (`KeyModulesInsightEditor` plural): 
   - Retrieves all `KeyModuleReport` objects from research memory
   - Creates concurrent tasks for each module
   - Uses `do_parallel_with_limit` to control LLM concurrency

2. **Agent Level** (`KeyModuleInsightEditor` singular):
   - Implements `StepForwardAgent` trait
   - Generates individual module documentation
   - Updates `DocTree` with localized file paths

---

## 3. Flow Coordination and Control

### 3.1 Centralized Context Management

The `GeneratorContext` serves as a dependency container and resource manager, implementing the **Context Pattern** to avoid parameter proliferation across async boundaries.

```mermaid
classDiagram
    class GeneratorContext {
        +Arc~RwLock~Config~~ config
        +Arc~RwLock~Box~dyn ProviderClient~~ llm_client
        +Arc~RwLock~CacheManager~~ cache
        +Arc~RwLock~Memory~ memory
        +store_to_memory()
        +get_from_memory()
        +load_external_knowledge()
    }
    
    class Memory {
        +PREPROCESSING
        +STUDIES_RESEARCH
        +DOCUMENTATION
        +store()
        +retrieve()
    }
    
    GeneratorContext --> Memory
    GeneratorContext --> CacheManager
    GeneratorContext --> ProviderClient
```

**Thread Safety Model**:
- All mutable shared state protected by `Arc<RwLock<T>>`
- Lock granularity at the service level (coarse-grained) rather than operation level
- Async-aware locking via `tokio::sync::RwLock`

### 3.2 Memory Scopes and Data Sharing

```mermaid
flowchart LR
    subgraph Preprocess["PREPROCESSING Scope"]
        P1[CodeInsights]
        P2[ProjectStructure]
        P3[RelationshipAnalysis]
        P4[README Content]
    end
    
    subgraph Research["STUDIES_RESEARCH Scope"]
        R1[SystemContextReport]
        R2[DomainModulesReport]
        R3[ArchitectureReport]
        R4[WorkflowReport]
        R5[KeyModuleReports]
        R6[BoundaryAnalysis]
        R7[DatabaseOverview]
    end
    
    subgraph Compose["DOCUMENTATION Scope"]
        C1[Overview Markdown]
        C2[Architecture Markdown]
        C3[Workflow Markdown]
        C4[Module Markdowns]
        C5[Boundary Markdown]
        C6[Database Markdown]
    end
    
    Preprocess --> Research --> Compose
    
    style Preprocess fill:#e3f2fd
    style Research fill:#e8f5e9
    style Compose fill:#fff3e0
```

**Data Retrieval Patterns**:
- **Required Data**: Hard dependencies validated by `AgentDataConfig` (e.g., DomainModulesDetector requires SystemContextResearcher output)
- **Optional Data**: Soft dependencies via `ExternalKnowledgeByCategory` (e.g., Architecture documents from knowledge base)
- **Source Data**: Raw code insights filtered by path patterns for domain-specific analysis

### 3.3 Execution Scheduling and Parallelism

The system implements three concurrency strategies:

1. **Pipeline Parallelism**: Stages execute sequentially (Preprocess → Research → Compose), but data flows asynchronously between them
2. **Data Parallelism**: Independent agents execute concurrently (ArchitectureResearcher || WorkflowResearcher || BoundaryAnalyzer)
3. **Task Parallelism**: File-level parallel processing within agents (CodeAnalyze, KeyModulesInsight)

**Resource Limits**:
- `max_parallels` from LLM configuration controls concurrent API calls
- `do_parallel_with_limit` utility provides backpressure and prevents resource exhaustion

### 3.4 State Transitions and Lifecycle

```mermaid
stateDiagram-v2
    [*] --> Initialization: CLI Args Parsed
    Initialization --> Preprocessing: launch()
    
    Preprocessing --> Research: PreProcessAgent Complete
    Research --> Composition: ResearchOrchestrator Complete
    Composition --> Output: DocumentationComposer Complete
    Output --> [*]: SummaryOutlet Complete
    
    state Preprocessing {
        [*] --> READMEExtract
        READMEExtract --> StructureAnalysis
        StructureAnalysis --> FileIdentification
        FileIdentification --> LanguageProcessing
        LanguageProcessing --> AIAnalysis
        AIAnalysis --> RelationshipAnalysis
        RelationshipAnalysis --> [*]
    }
    
    state Research {
        [*] --> SystemContext
        SystemContext --> DomainDetection
        DomainDetection --> ParallelResearch
        ParallelResearch --> [*]
        
        state ParallelResearch {
            [*] --> Architecture
            [*] --> Workflow
            [*] --> KeyModules
            [*] --> Boundaries
            [*] --> Database
        }
    }
    
    note right of Initialization
        GeneratorContext initialized
        with LLM client, cache, memory
    end note
```

---

## 4. Exception Handling and Recovery

### 4.1 Error Handling Architecture

The system employs **anyhow** for ergonomic error propagation with context, implementing a **Fail-Fast** strategy for critical errors and **Graceful Degradation** for non-critical components.

#### 4.1.1 Error Classification and Strategy

| Error Category | Strategy | Implementation |
|---------------|----------|----------------|
| **Configuration Errors** | Fail Fast | `?` propagation in `main.rs`, user-friendly messages via i18n |
| **LLM API Errors** | Retry with Exponential Backoff | Built into `ProviderClient` with configurable retries |
| **Cache Errors** | Degrade to Direct Execution | Cache errors logged but don't block execution |
| **Agent Analysis Errors** | Partial Success | `KeyModulesInsight` continues if individual domain fails |
| **File System Errors** | Fail Fast | `DiskOutlet` propagates IO errors immediately |

### 4.2 Resilience Patterns

#### 4.2.1 Circuit Breaker (Implicit)
The caching layer acts as a circuit breaker for LLM calls:
- Cache hit → Skip API call (reduces load during outages)
- Cache miss → Attempt API call with timeout
- API failure → Propagate error (no fallback to stale data for analysis)

#### 4.2.2 Graceful Degradation in Parallel Processing

```rust
// From KeyModulesInsight analysis
let results: Vec<_> = futures::future::join_all(tasks).await
    .into_iter()
    .filter_map(|result| {
        match result {
            Ok(report) => Some(report),
            Err(e) => {
                log::warn!("Domain analysis failed: {}", e);
                None // Continue with partial results
            }
        }
    })
    .collect();
```

#### 4.2.3 Token Overflow Protection

The `DatabaseOverviewAnalyzer` implements defensive truncation:
1. Filters to top 50 most important SQL files
2. Applies truncation to individual large files
3. Uses `PromptCompressor` for emergency content reduction
4. Falls back to directory-only view if compression fails

### 4.3 Recovery Mechanisms

**Partial Pipeline Recovery**:
- Preprocessing results cached in memory; research can resume without re-preprocessing if previous stage intact
- Research reports persisted to `STUDIES_RESEARCH` scope; composition can re-run without re-research
- Generated documents in `DOCUMENTATION` scope allow re-output without regeneration

**Mermaid Diagram Recovery**:
- `MermaidFixer` post-processes generated diagrams using external tool
- Syntax errors fixed automatically without user intervention
- If fixing fails, original content preserved with warning logs

### 4.4 Monitoring and Observability

**Performance Monitoring**:
- `CachePerformanceMonitor` tracks hit rates, cost savings, and inference time saved
- `TimingScope` records duration for each pipeline phase
- Summary reports include efficiency ratios (cost-per-second, improvement multipliers)

**Error Context**:
- All errors include the agent type and operation context via `anyhow::Context`
- Bilingual logging (English/Chinese) via `TargetLanguage` ensures accessibility

---

## 5. Key Process Implementation

### 5.1 StepForwardAgent Framework

The `StepForwardAgent` trait provides a **Template Method Pattern** for standardized agent execution, ensuring consistent data validation, prompt construction, and result storage.

```mermaid
classDiagram
    class StepForwardAgent {
        +execute()
        +data_config()
        +memory_scope()
        +prompt_template()
        +post_process()
    }
    
    class SystemContextResearcher {
        +Output: SystemContextReport
    }
    
    class DomainModulesDetector {
        +Output: DomainModulesReport
    }
    
    StepForwardAgent <|-- SystemContextResearcher
    StepForwardAgent <|-- DomainModulesDetector
    StepForwardAgent <|-- ArchitectureResearcher
    StepForwardAgent <|-- WorkflowResearcher
```

**Lifecycle Hooks**:
1. **Data Configuration**: Declares required/optional data sources via `AgentDataConfig`
2. **Data Collection**: Retrieves from memory scopes or external knowledge
3. **Content Formatting**: `DataFormatter` applies hierarchical formatting and compression
4. **Prompt Engineering**: `GeneratorPromptBuilder` constructs multilingual prompts
5. **LLM Invocation**: Supports `Extract` (JSON), `Prompt` (Free text), `PromptWithTools` (ReAct)
6. **Post-Processing**: Validation, side effects, and localized logging

### 5.2 Language Processor Architecture

The `LanguageProcessor` trait implements the **Strategy Pattern** for polyglot code analysis:

```rust
// Conceptual trait structure
trait LanguageProcessor {
    fn extensions(&self) -> &[&str];
    fn extract_dependencies(&self, content: &str) -> Vec<Dependency>;
    fn extract_interfaces(&self, content: &str) -> Vec<InterfaceInfo>;
    fn determine_type(&self, file: &FileInfo) -> ComponentType;
    fn calculate_complexity(&self, content: &str) -> ComplexityMetrics;
}
```

**Regex-Based Parsing Strategy**:
- Uses pre-compiled regex patterns for performance (avoids AST parsing overhead)
- Handles language-specific constructs (Swift optionals, Python decorators, Rust lifetimes)
- Falls back to generic patterns for unsupported constructs
- Trade-off: Speed vs. perfect accuracy for complex nested generics

### 5.3 Caching Strategy (Cache-Aside Pattern)

```mermaid
sequenceDiagram
    participant Client as AgentExecutor
    participant Cache as CacheManager
    participant LLM as LLM Provider
    
    Client->>Cache: get(key)
    alt Cache Hit
        Cache-->>Client: Cached Response
    else Cache Miss
        Client->>LLM: API Request
        LLM-->>Client: Response + Token Usage
        Client->>Cache: set(key, response, tokens)
        Cache-->>Client: Ack
    end
```

**Cache Key Generation**:
- Composite key: MD5 hash of `system_prompt + user_prompt + operation_type`
- Ensures isolation between different execution modes (Extract vs Prompt)
- Category-based directory structure for cache organization

**Token Preservation**:
- Cache entries store input/output token counts
- Enables accurate cost analysis even on cache hits
- Supports cache performance monitoring metrics

### 5.4 Prompt Engineering and Compression

**Template Structure**:
1. **System Instructions**: Role definition and output constraints
2. **Opening Instruction**: Task framing and context setting
3. **Research Materials**: Structured data from previous stages
4. **Closing Instruction**: Specific analysis requirements and format enforcement

**Compression Algorithm**:
```rust
// Hierarchical compression strategy
1. Content sorting by importance score (descending)
2. Truncation of low-importance items
3. Semantic compression (removing boilerplate, preserving semantics)
4. Emergency truncation (hard character limits)
```

**Multilingual Support**:
- `TargetLanguage` enum injection into prompts
- 8 supported languages with native script display names
- AI instruction templates guide LLM to generate documentation in target language

### 5.5 Internationalization (i18n) Integration

The system maintains **locale awareness** throughout the workflow:

- **File System**: Localized directory names (`1.Overview` vs `1、项目概述`)
- **Console Output**: 16+ message templates translated across 8 languages
- **Documentation Content**: LLM prompted to generate in target language
- **Error Messages**: Bilingual error context for debugging

**Implementation**:
- Exhaustive match expressions on `TargetLanguage` enum (8 variants)
- High cyclomatic complexity (54) accepted for compile-time exhaustiveness guarantees
- Emoji indicators provide visual feedback regardless of language

---

## Appendix: Workflow Performance Characteristics

| Workflow Phase | Time Complexity | Bottleneck | Optimization Strategy |
|---------------|----------------|------------|---------------------|
| **Preprocessing** | O(n) where n=files | AI Code Analysis | Parallel processing with `max_parallels` limit |
| **Research** | O(d) where d=domains | KeyModulesInsight (parallel per domain) | Concurrent agent execution |
| **Composition** | O(s) where s=sections | Sequential editor dependency | N/A (inherently sequential) |
| **Output** | O(docs) | MermaidFixer external tool | Async I/O with tokio |
| **Overall** | O(n + d + s) | LLM API latency | Caching, retry logic, token optimization |

**Resource Utilization**:
- **CPU**: High during regex parsing and file I/O
- **Network**: LLM API calls with configurable concurrency
- **Memory**: Scales with project size (CodeInsights stored in memory)
- **Disk**: Cache storage proportional to unique prompt count