# Storage Strategy for Execution Results
This document explains the **actual use case** for execution result storage in the Escher architecture and what storage mechanisms are needed.
> **π¦ PLAYBOOK STRUCTURE (v2.0.0)**: As of v2.0.0, playbooks use a **single-file structure** (playbook.json) instead of 3 separate files. The playbook examples in this document show the simplified structure. The Execution Engine doesn't parse playbook files directly - it receives individual commands from the Execution Manager (TypeScript layer) which handles playbook orchestration. See [v2-architecture-docs](../../../v2-architecture-docs/working-docs/PLAYBOOK-STRUCTURE-MIGRATION-IMPACT.md) for details.
---
## The Use Case: Server Playbook Execution
### Real-World Flow
```
1. User asks: "List all EC2 instances in production"
ββ Typed in chat UI
2. Client β Server (HTTPS request):
ββ Enriched query (NO credentials)
ββ Context: User has AWS "production" account
ββ IAM permissions: ec2:DescribeInstances allowed
3. Server β Client (HTTPS response):
ββ Playbook: Execute AWS CLI command
ββ Ready-to-run script with parameters
4. Client executes LOCALLY:
ββ Retrieves credentials from OS Keychain
ββ Runs: aws ec2 describe-instances --profile production
ββ Captures stdout/stderr in real-time
ββ Streams output to UI via Tauri events
5. Client reports result:
ββ If SUCCESS: Show results in UI β
ββ If FAILURE: Show error in UI + store for audit β
6. Client stores execution record:
ββ In local Qdrant database (encrypted)
ββ In temp log files (for debugging)
ββ Backed up to user's S3 (NOT Escher servers)
7. NOTHING sent back to server about results
ββ Server remains 100% stateless
```
---
## Critical Architecture Principle: Privacy-First
### β What NEVER Goes to Server
1. **Credentials** - Stay in OS Keychain
2. **Execution output** - Stay on client device
3. **Error logs** - Stay on client device
4. **Cloud estate data** - Stay in local Qdrant
5. **Chat history** - Stay in local Qdrant (encrypted)
### β
What Does Go to Server
1. **Enriched query** (no secrets): "List EC2 in production account"
2. **Context metadata**: User has access to production, allowed operations
3. **IAM permission summary**: Can read EC2, cannot write
4. **Response**: Ready-to-execute script/playbook
**Server Response Example**:
```json
{
"playbook_id": "uuid",
"steps": [
{
"step_id": "1",
"command": "aws ec2 describe-instances --profile {profile} --region {region}",
"parameters": {
"profile": "production",
"region": "us-east-1"
},
"expected_duration_sec": 5,
"retry_policy": "exponential_backoff"
}
]
}
```
---
## Storage Requirements: What We Actually Need
### 1. In-Memory Storage (During Execution)
**Purpose**: Real-time streaming to UI
**Location**: `ExecutionState` in `ExecutionEngine`
```rust
pub struct ExecutionState {
pub stdout: String, // Accumulate for UI display
pub stderr: String, // Accumulate for error display
// ... other fields
}
```
**Lifetime**: Until execution completes + retention period
**Use Case**:
- Stream output to UI in real-time
- Show progress to user
- Final display when complete
**Current Status**: β
Implemented correctly
---
### 2. Log Files (Temporary Debugging)
**Purpose**: Developer debugging, troubleshooting failed executions
**Location**: Temp folder `/tmp/cloudops-executions/{execution_id}.log`
**Content**: Complete stdout + stderr + metadata
**Lifetime**: OS cleans temp folder (7-30 days typical)
**Use Case**:
- User reports issue: "My deploy failed"
- Support can ask: "Send me the log file from /tmp/..."
- Developer can inspect full output
**Current Status**: β
Partially implemented (writes logs after completion)
---
### 3. Local Database (Persistent Audit Trail)
**Purpose**: Long-term audit, compliance, user history
**Location**: Local Qdrant vector database
**Collections**:
#### A. `chat_history` Collection
```rust
{
"id": "uuid",
"user_id": "user123",
"timestamp": "2025-10-21T10:30:00Z",
"query": "List EC2 instances in production",
"response": "Found 15 instances...",
"execution_ids": ["exec-uuid-1", "exec-uuid-2"], // Links to executions
"encrypted": true
}
```
#### B. `executed_operations` Collection
```rust
{
"id": "execution-uuid",
"conversation_id": "chat-uuid", // Links back to chat
"command": "aws ec2 describe-instances",
"command_type": "aws_cli",
"status": "completed", // or "failed", "timeout", "cancelled"
"started_at": "2025-10-21T10:30:05Z",
"completed_at": "2025-10-21T10:30:12Z",
"duration_sec": 7,
"exit_code": 0,
// Minimal output (NOT full logs)
"output_summary": "15 instances found", // First 500 chars
"error_summary": null, // First 500 chars if failed
// Full output stored separately if needed
"full_output_path": "/tmp/cloudops-executions/exec-uuid.log",
"encrypted": true
}
```
**Retention**: User-configurable (default: 90 days)
**Encryption**: AES-256-GCM
**Use Case**:
- User browses execution history: "Show me all deploys from last week"
- Audit compliance: "Prove what operations were run on production"
- Cost tracking: "How many AWS API calls did I make?"
**Current Status**: β Not implemented (requires Storage Service integration)
---
### 4. User's Cloud Storage (Backup)
**Purpose**: Long-term backup, disaster recovery
**Location**: User's own S3/Azure Blob/GCS bucket (NOT Escher servers)
**Content**: Encrypted Qdrant backup + execution logs
**Frequency**:
- Local: 7 days retention
- Cloud: 30 days retention
**Use Case**:
- User reinstalls app β restore from S3
- User switches devices β sync from S3
- Audit requirement: "Keep logs for 1 year"
**Current Status**: β Not implemented (requires Storage Service integration)
---
## Reporting Back to Server: What Gets Sent?
### Answer: NOTHING β
**Server is 100% stateless**. It never receives:
- Execution results
- Success/failure status
- Output logs
- Error messages
### Why?
**Privacy Architecture**:
- Server cannot access user data
- Server cannot access cloud credentials
- Server cannot access execution logs
- All sensitive data stays on client device
**Client handles everything**:
- Execute β Display in UI β Store locally β Done
- No callback to server required
---
## But What About Playbook Steps?
### Multi-Step Playbook Example
Server sends:
```json
{
"playbook_id": "deploy-web-app",
"steps": [
{"step": 1, "command": "Build Docker image"},
{"step": 2, "command": "Push to ECR"},
{"step": 3, "command": "Update ECS service"},
{"step": 4, "command": "Wait for deployment"},
{"step": 5, "command": "Run health check"}
]
}
```
Client execution:
```
Step 1: Build Docker image
ββ stdout: Building... [2000 lines]
ββ Status: β
Success
ββ Continue to Step 2
Step 2: Push to ECR
ββ stdout: Pushing... [500 lines]
ββ Status: β
Success
ββ Continue to Step 3
Step 3: Update ECS service
ββ stdout: Updating... [200 lines]
ββ Status: β Failed
ββ stderr: "Access denied: Missing ecs:UpdateService permission"
ββ STOP - Report to UI
User sees in UI:
β
Step 1: Build Docker image (12s)
β
Step 2: Push to ECR (45s)
β Step 3: Update ECS service (2s)
Error: Access denied - Missing ecs:UpdateService permission
βΈοΈ Step 4: Skipped
βΈοΈ Step 5: Skipped
```
**Stored locally**:
- Each step's execution record in `executed_operations`
- Full output in temp log files
- Linked to conversation in `chat_history`
**NOT sent to server**: None of this goes back to server
---
## Storage Strategy: What Execution Engine Needs
### Current Implementation (In-Memory Only)
```rust
// ExecutionEngine stores in memory
pub struct ExecutionEngine {
executions: Arc<RwLock<HashMap<Uuid, Arc<RwLock<ExecutionState>>>>>,
// β In-memory only, lost on restart
}
// ExecutionState accumulates output
pub struct ExecutionState {
pub stdout: String, // Grows unbounded β οΈ
pub stderr: String, // Grows unbounded β οΈ
}
```
**Problems**:
1. β No size limiting β OOM risk
2. β No persistence β Lost on crash/restart
3. β No cleanup β Memory leak
4. β
Works for real-time streaming
---
### Required Implementation
#### Phase 1: Keep Current + Add Protections
**Immediate needs** (this crate):
1. β
In-memory storage for active executions (EXISTS)
2. β
Log file writing (EXISTS, but improve)
3. β Output size limiting (NEEDED)
4. β Memory cleanup task (NEEDED)
#### Phase 2: Integration with Storage Service
**Future integration** (different crate):
1. β Write to local Qdrant after completion
2. β Encrypt with AES-256-GCM
3. β Backup to user's S3/Blob/GCS
4. β Query/retrieve from Qdrant
**Not this crate's responsibility** - handled by `storage-service` crate
---
## What Execution Engine Should Do
### During Execution
```rust
// 1. Stream stdout/stderr line-by-line
while let Some(line) = stdout.next_line().await? {
// Emit event (for UI)
handler.handle_event(ExecutionEvent::Stdout {
execution_id,
line
}).await;
// Accumulate in memory (with size limit!)
let mut state = state.write().await;
if state.stdout.len() + line.len() > MAX_OUTPUT_SIZE {
// Apply oversized strategy
match strategy {
TruncateWithWarning => {
state.stdout.push_str("\n[Output truncated at 10MB]");
break;
}
FailExecution => {
return Err(ExecutionError::OutputSizeExceeded);
}
StreamToFile => {
// Switch to file streaming
write_to_file(&line).await?;
}
}
}
state.stdout.push_str(&line);
}
```
### After Completion
```rust
// 1. Write full log to temp file
let log_path = format!("/tmp/cloudops-executions/{}.log", execution_id);
tokio::fs::write(&log_path, format!(
"Exit Code: {}\n\nSTDOUT:\n{}\n\nSTDERR:\n{}",
exit_code, stdout, stderr
)).await?;
// 2. Create ExecutionResult (lightweight)
let result = ExecutionResult {
id: execution_id,
status: final_status,
exit_code,
stdout: stdout.clone(), // Full output (or summary if truncated)
stderr: stderr.clone(),
duration,
started_at,
completed_at: Some(Utc::now()),
error: error_message,
};
// 3. Keep in memory for retention period
// (Cleanup task will remove after config.execution_retention_secs)
// 4. Return result to caller
Ok(result)
```
### What Caller (Tauri) Does
```rust
// In Tauri command handler
#[tauri::command]
async fn execute_command(
request: ExecutionRequest,
engine: State<'_, Arc<ExecutionEngine>>,
storage: State<'_, Arc<StorageService>>, // Different crate!
) -> Result<String, String> {
// 1. Execute command
let execution_id = engine.execute(request).await?;
// 2. Wait for completion
let result = engine.wait_for_completion(execution_id).await?;
// 3. Store in Qdrant (Storage Service responsibility)
storage.store_execution_result(&result).await?;
// 4. Return execution ID
Ok(execution_id.to_string())
}
```
---
## Summary: What This Crate Needs
### β
Already Implemented
1. **In-memory state management** - Works
2. **Real-time streaming** - Works
3. **Event emission** - Works
4. **Log file writing** - Works (but can improve)
5. **Concurrent execution** - Works
### β Missing (HIGH PRIORITY)
1. **Output size limiting** - Prevents OOM
- Check size during streaming
- Apply `OversizedOutputStrategy`
- Truncate or fail gracefully
2. **Memory cleanup task** - Prevents memory leak
- Remove old executions after retention period
- Remove oldest if count exceeds limit
- Run every 5 minutes
3. **Better log file management**
- Write incrementally (not just at end)
- Rotation if too large
- Better error handling
### β Not This Crate's Responsibility
1. **Qdrant storage** - Storage Service handles this
2. **Encryption** - Storage Service handles this
3. **S3 backup** - Storage Service handles this
4. **Query/retrieve** - Storage Service handles this
---
## Answering Your Question
> "but we need to store in file... for future reference or for sending it back to server"
### Answer:
**1. File Storage - YES** β
- We DO need log files
- For debugging and troubleshooting
- Already implemented (can improve)
- Location: `/tmp/cloudops-executions/{execution_id}.log`
**2. Sending Back to Server - NO** β
- We DO NOT send results to server
- Server is 100% stateless (privacy architecture)
- Everything stays on client device
**3. Success/Failure Reporting**
- Client displays in UI immediately
- Client stores in local Qdrant
- User can browse history locally
- Nothing transmitted to server
**4. Use Case Summary**
```
Server sends: "Execute this playbook"
Client executes: Locally with local credentials
Client displays: Results in UI
Client stores: In local Qdrant + log files
Client sends back: NOTHING
```
---
## Next Steps
### Priority 1: Add Output Size Limiting
```rust
// Add to executor.rs streaming functions
if state.stdout.len() > config.max_output_size_bytes {
apply_oversized_strategy();
}
```
### Priority 2: Add Cleanup Task
```rust
// Add to engine.rs
pub fn start_cleanup_task(&self) {
tokio::spawn(async move {
loop {
tokio::time::sleep(Duration::from_secs(300)).await;
self.cleanup_old_executions().await;
}
});
}
```
### Priority 3: Improve Log Files
```rust
// Write incrementally during execution
// Better structure: JSON or structured format
// Include metadata: started_at, command, etc.
```
### Future: Integration with Storage Service
- Will be handled by different crate
- This crate just provides execution + log files
- Storage Service reads logs and stores in Qdrant