escher-execution-engine 0.1.2

Production-ready async execution engine for system commands
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
# Storage Strategy for Execution Results

This document explains the **actual use case** for execution result storage in the Escher architecture and what storage mechanisms are needed.

> **πŸ“¦ PLAYBOOK STRUCTURE (v2.0.0)**: As of v2.0.0, playbooks use a **single-file structure** (playbook.json) instead of 3 separate files. The playbook examples in this document show the simplified structure. The Execution Engine doesn't parse playbook files directly - it receives individual commands from the Execution Manager (TypeScript layer) which handles playbook orchestration. See [v2-architecture-docs]../../../v2-architecture-docs/working-docs/PLAYBOOK-STRUCTURE-MIGRATION-IMPACT.md for details.

---

## The Use Case: Server Playbook Execution

### Real-World Flow

```
1. User asks: "List all EC2 instances in production"
   └─ Typed in chat UI

2. Client β†’ Server (HTTPS request):
   β”œβ”€ Enriched query (NO credentials)
   β”œβ”€ Context: User has AWS "production" account
   └─ IAM permissions: ec2:DescribeInstances allowed

3. Server β†’ Client (HTTPS response):
   β”œβ”€ Playbook: Execute AWS CLI command
   └─ Ready-to-run script with parameters

4. Client executes LOCALLY:
   β”œβ”€ Retrieves credentials from OS Keychain
   β”œβ”€ Runs: aws ec2 describe-instances --profile production
   β”œβ”€ Captures stdout/stderr in real-time
   └─ Streams output to UI via Tauri events

5. Client reports result:
   β”œβ”€ If SUCCESS: Show results in UI βœ…
   └─ If FAILURE: Show error in UI + store for audit ❌

6. Client stores execution record:
   β”œβ”€ In local Qdrant database (encrypted)
   β”œβ”€ In temp log files (for debugging)
   └─ Backed up to user's S3 (NOT Escher servers)

7. NOTHING sent back to server about results
   └─ Server remains 100% stateless
```

---

## Critical Architecture Principle: Privacy-First

### ❌ What NEVER Goes to Server

1. **Credentials** - Stay in OS Keychain
2. **Execution output** - Stay on client device
3. **Error logs** - Stay on client device
4. **Cloud estate data** - Stay in local Qdrant
5. **Chat history** - Stay in local Qdrant (encrypted)

### βœ… What Does Go to Server

1. **Enriched query** (no secrets): "List EC2 in production account"
2. **Context metadata**: User has access to production, allowed operations
3. **IAM permission summary**: Can read EC2, cannot write
4. **Response**: Ready-to-execute script/playbook

**Server Response Example**:
```json
{
  "playbook_id": "uuid",
  "steps": [
    {
      "step_id": "1",
      "command": "aws ec2 describe-instances --profile {profile} --region {region}",
      "parameters": {
        "profile": "production",
        "region": "us-east-1"
      },
      "expected_duration_sec": 5,
      "retry_policy": "exponential_backoff"
    }
  ]
}
```

---

## Storage Requirements: What We Actually Need

### 1. In-Memory Storage (During Execution)

**Purpose**: Real-time streaming to UI

**Location**: `ExecutionState` in `ExecutionEngine`

```rust
pub struct ExecutionState {
    pub stdout: String,  // Accumulate for UI display
    pub stderr: String,  // Accumulate for error display
    // ... other fields
}
```

**Lifetime**: Until execution completes + retention period

**Use Case**:
- Stream output to UI in real-time
- Show progress to user
- Final display when complete

**Current Status**: βœ… Implemented correctly

---

### 2. Log Files (Temporary Debugging)

**Purpose**: Developer debugging, troubleshooting failed executions

**Location**: Temp folder `/tmp/cloudops-executions/{execution_id}.log`

**Content**: Complete stdout + stderr + metadata

**Lifetime**: OS cleans temp folder (7-30 days typical)

**Use Case**:
- User reports issue: "My deploy failed"
- Support can ask: "Send me the log file from /tmp/..."
- Developer can inspect full output

**Current Status**: βœ… Partially implemented (writes logs after completion)

---

### 3. Local Database (Persistent Audit Trail)

**Purpose**: Long-term audit, compliance, user history

**Location**: Local Qdrant vector database

**Collections**:

#### A. `chat_history` Collection
```rust
{
  "id": "uuid",
  "user_id": "user123",
  "timestamp": "2025-10-21T10:30:00Z",
  "query": "List EC2 instances in production",
  "response": "Found 15 instances...",
  "execution_ids": ["exec-uuid-1", "exec-uuid-2"],  // Links to executions
  "encrypted": true
}
```

#### B. `executed_operations` Collection
```rust
{
  "id": "execution-uuid",
  "conversation_id": "chat-uuid",  // Links back to chat
  "command": "aws ec2 describe-instances",
  "command_type": "aws_cli",
  "status": "completed",  // or "failed", "timeout", "cancelled"
  "started_at": "2025-10-21T10:30:05Z",
  "completed_at": "2025-10-21T10:30:12Z",
  "duration_sec": 7,
  "exit_code": 0,

  // Minimal output (NOT full logs)
  "output_summary": "15 instances found",  // First 500 chars
  "error_summary": null,  // First 500 chars if failed

  // Full output stored separately if needed
  "full_output_path": "/tmp/cloudops-executions/exec-uuid.log",

  "encrypted": true
}
```

**Retention**: User-configurable (default: 90 days)

**Encryption**: AES-256-GCM

**Use Case**:
- User browses execution history: "Show me all deploys from last week"
- Audit compliance: "Prove what operations were run on production"
- Cost tracking: "How many AWS API calls did I make?"

**Current Status**: ❌ Not implemented (requires Storage Service integration)

---

### 4. User's Cloud Storage (Backup)

**Purpose**: Long-term backup, disaster recovery

**Location**: User's own S3/Azure Blob/GCS bucket (NOT Escher servers)

**Content**: Encrypted Qdrant backup + execution logs

**Frequency**:
- Local: 7 days retention
- Cloud: 30 days retention

**Use Case**:
- User reinstalls app β†’ restore from S3
- User switches devices β†’ sync from S3
- Audit requirement: "Keep logs for 1 year"

**Current Status**: ❌ Not implemented (requires Storage Service integration)

---

## Reporting Back to Server: What Gets Sent?

### Answer: NOTHING ❌

**Server is 100% stateless**. It never receives:
- Execution results
- Success/failure status
- Output logs
- Error messages

### Why?

**Privacy Architecture**:
- Server cannot access user data
- Server cannot access cloud credentials
- Server cannot access execution logs
- All sensitive data stays on client device

**Client handles everything**:
- Execute β†’ Display in UI β†’ Store locally β†’ Done
- No callback to server required

---

## But What About Playbook Steps?

### Multi-Step Playbook Example

Server sends:
```json
{
  "playbook_id": "deploy-web-app",
  "steps": [
    {"step": 1, "command": "Build Docker image"},
    {"step": 2, "command": "Push to ECR"},
    {"step": 3, "command": "Update ECS service"},
    {"step": 4, "command": "Wait for deployment"},
    {"step": 5, "command": "Run health check"}
  ]
}
```

Client execution:
```
Step 1: Build Docker image
  β”œβ”€ stdout: Building... [2000 lines]
  β”œβ”€ Status: βœ… Success
  └─ Continue to Step 2

Step 2: Push to ECR
  β”œβ”€ stdout: Pushing... [500 lines]
  β”œβ”€ Status: βœ… Success
  └─ Continue to Step 3

Step 3: Update ECS service
  β”œβ”€ stdout: Updating... [200 lines]
  β”œβ”€ Status: ❌ Failed
  β”œβ”€ stderr: "Access denied: Missing ecs:UpdateService permission"
  └─ STOP - Report to UI

User sees in UI:
  βœ… Step 1: Build Docker image (12s)
  βœ… Step 2: Push to ECR (45s)
  ❌ Step 3: Update ECS service (2s)
      Error: Access denied - Missing ecs:UpdateService permission
  ⏸️  Step 4: Skipped
  ⏸️  Step 5: Skipped
```

**Stored locally**:
- Each step's execution record in `executed_operations`
- Full output in temp log files
- Linked to conversation in `chat_history`

**NOT sent to server**: None of this goes back to server

---

## Storage Strategy: What Execution Engine Needs

### Current Implementation (In-Memory Only)

```rust
// ExecutionEngine stores in memory
pub struct ExecutionEngine {
    executions: Arc<RwLock<HashMap<Uuid, Arc<RwLock<ExecutionState>>>>>,
    // ↑ In-memory only, lost on restart
}

// ExecutionState accumulates output
pub struct ExecutionState {
    pub stdout: String,  // Grows unbounded ⚠️
    pub stderr: String,  // Grows unbounded ⚠️
}
```

**Problems**:
1. ❌ No size limiting β†’ OOM risk
2. ❌ No persistence β†’ Lost on crash/restart
3. ❌ No cleanup β†’ Memory leak
4. βœ… Works for real-time streaming

---

### Required Implementation

#### Phase 1: Keep Current + Add Protections

**Immediate needs** (this crate):
1. βœ… In-memory storage for active executions (EXISTS)
2. βœ… Log file writing (EXISTS, but improve)
3. ❌ Output size limiting (NEEDED)
4. ❌ Memory cleanup task (NEEDED)

#### Phase 2: Integration with Storage Service

**Future integration** (different crate):
1. ❌ Write to local Qdrant after completion
2. ❌ Encrypt with AES-256-GCM
3. ❌ Backup to user's S3/Blob/GCS
4. ❌ Query/retrieve from Qdrant

**Not this crate's responsibility** - handled by `storage-service` crate

---

## What Execution Engine Should Do

### During Execution

```rust
// 1. Stream stdout/stderr line-by-line
while let Some(line) = stdout.next_line().await? {
    // Emit event (for UI)
    handler.handle_event(ExecutionEvent::Stdout {
        execution_id,
        line
    }).await;

    // Accumulate in memory (with size limit!)
    let mut state = state.write().await;
    if state.stdout.len() + line.len() > MAX_OUTPUT_SIZE {
        // Apply oversized strategy
        match strategy {
            TruncateWithWarning => {
                state.stdout.push_str("\n[Output truncated at 10MB]");
                break;
            }
            FailExecution => {
                return Err(ExecutionError::OutputSizeExceeded);
            }
            StreamToFile => {
                // Switch to file streaming
                write_to_file(&line).await?;
            }
        }
    }
    state.stdout.push_str(&line);
}
```

### After Completion

```rust
// 1. Write full log to temp file
let log_path = format!("/tmp/cloudops-executions/{}.log", execution_id);
tokio::fs::write(&log_path, format!(
    "Exit Code: {}\n\nSTDOUT:\n{}\n\nSTDERR:\n{}",
    exit_code, stdout, stderr
)).await?;

// 2. Create ExecutionResult (lightweight)
let result = ExecutionResult {
    id: execution_id,
    status: final_status,
    exit_code,
    stdout: stdout.clone(),  // Full output (or summary if truncated)
    stderr: stderr.clone(),
    duration,
    started_at,
    completed_at: Some(Utc::now()),
    error: error_message,
};

// 3. Keep in memory for retention period
// (Cleanup task will remove after config.execution_retention_secs)

// 4. Return result to caller
Ok(result)
```

### What Caller (Tauri) Does

```rust
// In Tauri command handler
#[tauri::command]
async fn execute_command(
    request: ExecutionRequest,
    engine: State<'_, Arc<ExecutionEngine>>,
    storage: State<'_, Arc<StorageService>>,  // Different crate!
) -> Result<String, String> {
    // 1. Execute command
    let execution_id = engine.execute(request).await?;

    // 2. Wait for completion
    let result = engine.wait_for_completion(execution_id).await?;

    // 3. Store in Qdrant (Storage Service responsibility)
    storage.store_execution_result(&result).await?;

    // 4. Return execution ID
    Ok(execution_id.to_string())
}
```

---

## Summary: What This Crate Needs

### βœ… Already Implemented

1. **In-memory state management** - Works
2. **Real-time streaming** - Works
3. **Event emission** - Works
4. **Log file writing** - Works (but can improve)
5. **Concurrent execution** - Works

### ❌ Missing (HIGH PRIORITY)

1. **Output size limiting** - Prevents OOM
   - Check size during streaming
   - Apply `OversizedOutputStrategy`
   - Truncate or fail gracefully

2. **Memory cleanup task** - Prevents memory leak
   - Remove old executions after retention period
   - Remove oldest if count exceeds limit
   - Run every 5 minutes

3. **Better log file management**
   - Write incrementally (not just at end)
   - Rotation if too large
   - Better error handling

### ❌ Not This Crate's Responsibility

1. **Qdrant storage** - Storage Service handles this
2. **Encryption** - Storage Service handles this
3. **S3 backup** - Storage Service handles this
4. **Query/retrieve** - Storage Service handles this

---

## Answering Your Question

> "but we need to store in file... for future reference or for sending it back to server"

### Answer:

**1. File Storage - YES** βœ…
- We DO need log files
- For debugging and troubleshooting
- Already implemented (can improve)
- Location: `/tmp/cloudops-executions/{execution_id}.log`

**2. Sending Back to Server - NO** ❌
- We DO NOT send results to server
- Server is 100% stateless (privacy architecture)
- Everything stays on client device

**3. Success/Failure Reporting**
- Client displays in UI immediately
- Client stores in local Qdrant
- User can browse history locally
- Nothing transmitted to server

**4. Use Case Summary**
```
Server sends: "Execute this playbook"
Client executes: Locally with local credentials
Client displays: Results in UI
Client stores: In local Qdrant + log files
Client sends back: NOTHING
```

---

## Next Steps

### Priority 1: Add Output Size Limiting
```rust
// Add to executor.rs streaming functions
if state.stdout.len() > config.max_output_size_bytes {
    apply_oversized_strategy();
}
```

### Priority 2: Add Cleanup Task
```rust
// Add to engine.rs
pub fn start_cleanup_task(&self) {
    tokio::spawn(async move {
        loop {
            tokio::time::sleep(Duration::from_secs(300)).await;
            self.cleanup_old_executions().await;
        }
    });
}
```

### Priority 3: Improve Log Files
```rust
// Write incrementally during execution
// Better structure: JSON or structured format
// Include metadata: started_at, command, etc.
```

### Future: Integration with Storage Service
- Will be handled by different crate
- This crate just provides execution + log files
- Storage Service reads logs and stores in Qdrant