escher-execution-engine 0.1.2

# Storage Strategy for Execution Results

This document explains the **actual use case** for execution result storage in the Escher architecture and what storage mechanisms are needed.

> **📦 PLAYBOOK STRUCTURE (v2.0.0)**: As of v2.0.0, playbooks use a **single-file structure** (playbook.json) instead of 3 separate files. The playbook examples in this document show the simplified structure. The Execution Engine doesn't parse playbook files directly - it receives individual commands from the Execution Manager (TypeScript layer) which handles playbook orchestration. See [v2-architecture-docs](../../../v2-architecture-docs/working-docs/PLAYBOOK-STRUCTURE-MIGRATION-IMPACT.md) for details.

---

## The Use Case: Server Playbook Execution

### Real-World Flow

```
1. User asks: "List all EC2 instances in production"
   └─ Typed in chat UI

2. Client → Server (HTTPS request):
   ├─ Enriched query (NO credentials)
   ├─ Context: User has AWS "production" account
   └─ IAM permissions: ec2:DescribeInstances allowed

3. Server → Client (HTTPS response):
   ├─ Playbook: Execute AWS CLI command
   └─ Ready-to-run script with parameters

4. Client executes LOCALLY:
   ├─ Retrieves credentials from OS Keychain
   ├─ Runs: aws ec2 describe-instances --profile production
   ├─ Captures stdout/stderr in real-time
   └─ Streams output to UI via Tauri events

5. Client reports result:
   ├─ If SUCCESS: Show results in UI ✅
   └─ If FAILURE: Show error in UI + store for audit ❌

6. Client stores execution record:
   ├─ In local Qdrant database (encrypted)
   ├─ In temp log files (for debugging)
   └─ Backed up to user's S3 (NOT Escher servers)

7. NOTHING sent back to server about results
   └─ Server remains 100% stateless
```

---

## Critical Architecture Principle: Privacy-First

### ❌ What NEVER Goes to Server

1. **Credentials** - Stay in OS Keychain
2. **Execution output** - Stay on client device
3. **Error logs** - Stay on client device
4. **Cloud estate data** - Stay in local Qdrant
5. **Chat history** - Stay in local Qdrant (encrypted)

### ✅ What Does Go to Server

1. **Enriched query** (no secrets): "List EC2 in production account"
2. **Context metadata**: User has access to production, allowed operations
3. **IAM permission summary**: Can read EC2, cannot write
4. **Response**: Ready-to-execute script/playbook

**Server Response Example**:
```json
{
  "playbook_id": "uuid",
  "steps": [
    {
      "step_id": "1",
      "command": "aws ec2 describe-instances --profile {profile} --region {region}",
      "parameters": {
        "profile": "production",
        "region": "us-east-1"
      },
      "expected_duration_sec": 5,
      "retry_policy": "exponential_backoff"
    }
  ]
}
```

---

## Storage Requirements: What We Actually Need

### 1. In-Memory Storage (During Execution)

**Purpose**: Real-time streaming to UI

**Location**: `ExecutionState` in `ExecutionEngine`

```rust
pub struct ExecutionState {
    pub stdout: String,  // Accumulate for UI display
    pub stderr: String,  // Accumulate for error display
    // ... other fields
}
```

**Lifetime**: Until execution completes + retention period

**Use Case**:
- Stream output to UI in real-time
- Show progress to user
- Final display when complete

**Current Status**: ✅ Implemented correctly

---

### 2. Log Files (Temporary Debugging)

**Purpose**: Developer debugging, troubleshooting failed executions

**Location**: Temp folder `/tmp/cloudops-executions/{execution_id}.log`

**Content**: Complete stdout + stderr + metadata

**Lifetime**: OS cleans temp folder (7-30 days typical)

**Use Case**:
- User reports issue: "My deploy failed"
- Support can ask: "Send me the log file from /tmp/..."
- Developer can inspect full output

**Current Status**: ✅ Partially implemented (writes logs after completion)

---

### 3. Local Database (Persistent Audit Trail)

**Purpose**: Long-term audit, compliance, user history

**Location**: Local Qdrant vector database

**Collections**:

#### A. `chat_history` Collection
```rust
{
  "id": "uuid",
  "user_id": "user123",
  "timestamp": "2025-10-21T10:30:00Z",
  "query": "List EC2 instances in production",
  "response": "Found 15 instances...",
  "execution_ids": ["exec-uuid-1", "exec-uuid-2"],  // Links to executions
  "encrypted": true
}
```

#### B. `executed_operations` Collection
```rust
{
  "id": "execution-uuid",
  "conversation_id": "chat-uuid",  // Links back to chat
  "command": "aws ec2 describe-instances",
  "command_type": "aws_cli",
  "status": "completed",  // or "failed", "timeout", "cancelled"
  "started_at": "2025-10-21T10:30:05Z",
  "completed_at": "2025-10-21T10:30:12Z",
  "duration_sec": 7,
  "exit_code": 0,

  // Minimal output (NOT full logs)
  "output_summary": "15 instances found",  // First 500 chars
  "error_summary": null,  // First 500 chars if failed

  // Full output stored separately if needed
  "full_output_path": "/tmp/cloudops-executions/exec-uuid.log",

  "encrypted": true
}
```

**Retention**: User-configurable (default: 90 days)

**Encryption**: AES-256-GCM

**Use Case**:
- User browses execution history: "Show me all deploys from last week"
- Audit compliance: "Prove what operations were run on production"
- Cost tracking: "How many AWS API calls did I make?"

**Current Status**: ❌ Not implemented (requires Storage Service integration)

---

### 4. User's Cloud Storage (Backup)

**Purpose**: Long-term backup, disaster recovery

**Location**: User's own S3/Azure Blob/GCS bucket (NOT Escher servers)

**Content**: Encrypted Qdrant backup + execution logs

**Frequency**:
- Local: 7 days retention
- Cloud: 30 days retention

**Use Case**:
- User reinstalls app → restore from S3
- User switches devices → sync from S3
- Audit requirement: "Keep logs for 1 year"

**Current Status**: ❌ Not implemented (requires Storage Service integration)

---

## Reporting Back to Server: What Gets Sent?

### Answer: NOTHING ❌

**Server is 100% stateless**. It never receives:
- Execution results
- Success/failure status
- Output logs
- Error messages

### Why?

**Privacy Architecture**:
- Server cannot access user data
- Server cannot access cloud credentials
- Server cannot access execution logs
- All sensitive data stays on client device

**Client handles everything**:
- Execute → Display in UI → Store locally → Done
- No callback to server required

---

## But What About Playbook Steps?

### Multi-Step Playbook Example

Server sends:
```json
{
  "playbook_id": "deploy-web-app",
  "steps": [
    {"step": 1, "command": "Build Docker image"},
    {"step": 2, "command": "Push to ECR"},
    {"step": 3, "command": "Update ECS service"},
    {"step": 4, "command": "Wait for deployment"},
    {"step": 5, "command": "Run health check"}
  ]
}
```

Client execution:
```
Step 1: Build Docker image
  ├─ stdout: Building... [2000 lines]
  ├─ Status: ✅ Success
  └─ Continue to Step 2

Step 2: Push to ECR
  ├─ stdout: Pushing... [500 lines]
  ├─ Status: ✅ Success
  └─ Continue to Step 3

Step 3: Update ECS service
  ├─ stdout: Updating... [200 lines]
  ├─ Status: ❌ Failed
  ├─ stderr: "Access denied: Missing ecs:UpdateService permission"
  └─ STOP - Report to UI

User sees in UI:
  ✅ Step 1: Build Docker image (12s)
  ✅ Step 2: Push to ECR (45s)
  ❌ Step 3: Update ECS service (2s)
      Error: Access denied - Missing ecs:UpdateService permission
  ⏸️  Step 4: Skipped
  ⏸️  Step 5: Skipped
```

**Stored locally**:
- Each step's execution record in `executed_operations`
- Full output in temp log files
- Linked to conversation in `chat_history`

**NOT sent to server**: None of this goes back to server

---

## Storage Strategy: What Execution Engine Needs

### Current Implementation (In-Memory Only)

```rust
// ExecutionEngine stores in memory
pub struct ExecutionEngine {
    executions: Arc<RwLock<HashMap<Uuid, Arc<RwLock<ExecutionState>>>>>,
    // ↑ In-memory only, lost on restart
}

// ExecutionState accumulates output
pub struct ExecutionState {
    pub stdout: String,  // Grows unbounded ⚠️
    pub stderr: String,  // Grows unbounded ⚠️
}
```

**Problems**:
1. ❌ No size limiting → OOM risk
2. ❌ No persistence → Lost on crash/restart
3. ❌ No cleanup → Memory leak
4. ✅ Works for real-time streaming

---

### Required Implementation

#### Phase 1: Keep Current + Add Protections

**Immediate needs** (this crate):
1. ✅ In-memory storage for active executions (EXISTS)
2. ✅ Log file writing (EXISTS, but improve)
3. ❌ Output size limiting (NEEDED)
4. ❌ Memory cleanup task (NEEDED)

#### Phase 2: Integration with Storage Service

**Future integration** (different crate):
1. ❌ Write to local Qdrant after completion
2. ❌ Encrypt with AES-256-GCM
3. ❌ Backup to user's S3/Blob/GCS
4. ❌ Query/retrieve from Qdrant

**Not this crate's responsibility** - handled by `storage-service` crate

---

## What Execution Engine Should Do

### During Execution

```rust
// 1. Stream stdout/stderr line-by-line
while let Some(line) = stdout.next_line().await? {
    // Emit event (for UI)
    handler.handle_event(ExecutionEvent::Stdout {
        execution_id,
        line
    }).await;

    // Accumulate in memory (with size limit!)
    let mut state = state.write().await;
    if state.stdout.len() + line.len() > MAX_OUTPUT_SIZE {
        // Apply oversized strategy
        match strategy {
            TruncateWithWarning => {
                state.stdout.push_str("\n[Output truncated at 10MB]");
                break;
            }
            FailExecution => {
                return Err(ExecutionError::OutputSizeExceeded);
            }
            StreamToFile => {
                // Switch to file streaming
                write_to_file(&line).await?;
            }
        }
    }
    state.stdout.push_str(&line);
}
```

### After Completion

```rust
// 1. Write full log to temp file
let log_path = format!("/tmp/cloudops-executions/{}.log", execution_id);
tokio::fs::write(&log_path, format!(
    "Exit Code: {}\n\nSTDOUT:\n{}\n\nSTDERR:\n{}",
    exit_code, stdout, stderr
)).await?;

// 2. Create ExecutionResult (lightweight)
let result = ExecutionResult {
    id: execution_id,
    status: final_status,
    exit_code,
    stdout: stdout.clone(),  // Full output (or summary if truncated)
    stderr: stderr.clone(),
    duration,
    started_at,
    completed_at: Some(Utc::now()),
    error: error_message,
};

// 3. Keep in memory for retention period
// (Cleanup task will remove after config.execution_retention_secs)

// 4. Return result to caller
Ok(result)
```

### What Caller (Tauri) Does

```rust
// In Tauri command handler
#[tauri::command]
async fn execute_command(
    request: ExecutionRequest,
    engine: State<'_, Arc<ExecutionEngine>>,
    storage: State<'_, Arc<StorageService>>,  // Different crate!
) -> Result<String, String> {
    // 1. Execute command
    let execution_id = engine.execute(request).await?;

    // 2. Wait for completion
    let result = engine.wait_for_completion(execution_id).await?;

    // 3. Store in Qdrant (Storage Service responsibility)
    storage.store_execution_result(&result).await?;

    // 4. Return execution ID
    Ok(execution_id.to_string())
}
```

---

## Summary: What This Crate Needs

### ✅ Already Implemented

1. **In-memory state management** - Works
2. **Real-time streaming** - Works
3. **Event emission** - Works
4. **Log file writing** - Works (but can improve)
5. **Concurrent execution** - Works

### ❌ Missing (HIGH PRIORITY)

1. **Output size limiting** - Prevents OOM
   - Check size during streaming
   - Apply `OversizedOutputStrategy`
   - Truncate or fail gracefully

2. **Memory cleanup task** - Prevents memory leak
   - Remove old executions after retention period
   - Remove oldest if count exceeds limit
   - Run every 5 minutes

3. **Better log file management**
   - Write incrementally (not just at end)
   - Rotation if too large
   - Better error handling

### ❌ Not This Crate's Responsibility

1. **Qdrant storage** - Storage Service handles this
2. **Encryption** - Storage Service handles this
3. **S3 backup** - Storage Service handles this
4. **Query/retrieve** - Storage Service handles this

---

## Answering Your Question

> "but we need to store in file... for future reference or for sending it back to server"

### Answer:

**1. File Storage - YES** ✅
- We DO need log files
- For debugging and troubleshooting
- Already implemented (can improve)
- Location: `/tmp/cloudops-executions/{execution_id}.log`

**2. Sending Back to Server - NO** ❌
- We DO NOT send results to server
- Server is 100% stateless (privacy architecture)
- Everything stays on client device

**3. Success/Failure Reporting**
- Client displays in UI immediately
- Client stores in local Qdrant
- User can browse history locally
- Nothing transmitted to server

**4. Use Case Summary**
```
Server sends: "Execute this playbook"
Client executes: Locally with local credentials
Client displays: Results in UI
Client stores: In local Qdrant + log files
Client sends back: NOTHING
```

---

## Next Steps

### Priority 1: Add Output Size Limiting
```rust
// Add to executor.rs streaming functions
if state.stdout.len() > config.max_output_size_bytes {
    apply_oversized_strategy();
}
```

### Priority 2: Add Cleanup Task
```rust
// Add to engine.rs
pub fn start_cleanup_task(&self) {
    tokio::spawn(async move {
        loop {
            tokio::time::sleep(Duration::from_secs(300)).await;
            self.cleanup_old_executions().await;
        }
    });
}
```

### Priority 3: Improve Log Files
```rust
// Write incrementally during execution
// Better structure: JSON or structured format
// Include metadata: started_at, command, etc.
```

### Future: Integration with Storage Service
- Will be handled by different crate
- This crate just provides execution + log files
- Storage Service reads logs and stores in Qdrant