duroxide 0.1.27

Durable code execution framework for Rust
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
# Proposal: Automatic Execution Pruning

**Status:** Draft
**Author:** TBD
**Created:** 2024-01-04

## Summary

This proposal explores options for automatic execution pruning in long-running orchestrations that use `continue_as_new`. Currently, users must manually implement pruning via activities (see `sample_self_pruning_eternal_orchestration`). This proposal evaluates system-level alternatives.

## Problem Statement

Eternal orchestrations using `continue_as_new` accumulate execution history over time:

```
Execution 1 → ContinueAsNew → Execution 2 → ContinueAsNew → Execution 3 → ...
```

Each execution retains its full event history. For orchestrations running indefinitely (queue processors, schedulers, monitors), this leads to:

1. **Unbounded storage growth** - Each execution adds to total storage
2. **Slower queries** - `list_executions()` returns growing lists
3. **Manual cleanup burden** - Users must implement pruning logic

## Current Solution: Activity-Based Pruning

Users can implement self-pruning via an activity:

```rust
.register("PruneSelf", |ctx: ActivityContext, _: String| async move {
    let client = ctx.get_client();
    client.prune_executions(ctx.instance_id(), PruneOptions {
        keep_last: Some(1),
        ..Default::default()
    }).await.map_err(|e| e.to_string())?;
    Ok("pruned".to_string())
})

// In orchestration:
ctx.schedule_activity("PruneSelf", "").into_activity().await?;
ctx.continue_as_new(next_state).await
```

**Pros:**
- Explicit, visible in orchestration code
- Flexible timing and options
- No framework changes needed

**Cons:**
- Boilerplate in every eternal orchestration
- Easy to forget
- Activity overhead (scheduling, execution, history events)
- Must be called before `continue_as_new` (ordering matters)

---

## Proposed Alternatives

### Option A: System Activity (like `ctx.new_guid()`)

Add a built-in system operation that prunes without user-defined activities:

```rust
// Prune all but current execution
ctx.prune_history(PruneOptions { keep_last: Some(1), ..Default::default() }).await;
ctx.continue_as_new(next_state).await
```

**Implementation:**
- New `Action::PruneHistory { options }` variant
- Runtime handles pruning during orchestration dispatch
- Records `HistoryPruned { executions_deleted, events_deleted }` event

**Pros:**
- No activity registration required
- Deterministic (replays correctly)
- Lower overhead than activity (no queue, no worker dispatch)
- Explicit in orchestration code

**Cons:**
- New action type and event type
- Still requires user to remember to call it
- Adds complexity to replay engine

**User Personas:**
- *Power users*: Appreciate explicit control
- *New users*: May still forget to add it

---

### Option B: Parameter on `continue_as_new()`

Combine pruning with the continue operation:

```rust
// Option B1: Simple boolean
ctx.continue_as_new_with_prune(next_state, true).await

// Option B2: With options
ctx.continue_as_new_pruned(next_state, PruneOptions {
    keep_last: Some(3),  // Keep last 3 executions
    ..Default::default()
}).await

// Option B3: Builder pattern
ctx.continue_as_new(next_state)
    .prune_before(PruneOptions { keep_last: Some(1), ..Default::default() })
    .await
```

**Implementation:**
- Extend `Action::ContinueAsNew` with optional `prune_options: Option<PruneOptions>`
- Runtime prunes before starting new execution (atomic)

**Pros:**
- Natural pairing (prune and continue are logically linked)
- Single call instead of two
- Atomic operation (prune + continue together)
- Hard to get ordering wrong

**Cons:**
- API proliferation (`continue_as_new` vs `continue_as_new_pruned` vs `continue_as_new_with_options`)
- Less flexible if user wants to prune at other times
- Couples two concerns

**User Personas:**
- *New users*: Clear, hard to misuse
- *Power users*: May want pruning decoupled from continue

---

### Option C: Orchestration Registration Option

Declare pruning policy at orchestration definition:

```rust
OrchestrationRegistry::builder()
    .register_with_options(
        "EternalProcessor",
        orchestration_fn,
        OrchestrationOptions {
            auto_prune: Some(AutoPrunePolicy {
                keep_last: 1,
                on: PruneTrigger::ContinueAsNew,  // or ::EveryNExecutions(10)
            }),
            ..Default::default()
        }
    )
    .build()
```

**Implementation:**
- Store policy in orchestration metadata
- Runtime automatically prunes based on trigger condition
- No orchestration code changes needed

**Pros:**
- Zero orchestration code changes
- Policy defined once at registration
- Consistent across all instances of that orchestration
- "Set and forget"

**Cons:**
- Hidden behavior (not visible in orchestration code)
- Less flexible per-instance customization
- New registration API
- Policy might not fit all instances

**User Personas:**
- *Ops/Platform teams*: Love declarative policies
- *Developers*: May be surprised by implicit behavior
- *Debuggers*: "Where did my history go?"

---

### Option D: Runtime Configuration

Global or per-orchestration-type configuration:

```rust
// runtime_config.toml
[pruning]
enabled = true
default_keep_last = 5
triggers = ["continue_as_new"]

[pruning.overrides."EternalProcessor"]
keep_last = 1

// Or in code:
RuntimeOptions {
    auto_prune: AutoPruneConfig {
        default_policy: Some(PrunePolicy { keep_last: 5 }),
        overrides: hashmap! {
            "EternalProcessor" => PrunePolicy { keep_last: 1 },
        },
    },
    ..Default::default()
}
```

**Implementation:**
- Runtime checks config on each `continue_as_new`
- Applies policy based on orchestration name

**Pros:**
- No code changes to orchestrations
- Ops can tune without redeployment
- Centralized policy management
- Easy to enable/disable globally

**Cons:**
- Configuration complexity
- Disconnected from orchestration logic
- Hard to test (behavior depends on config)
- "Spooky action at a distance"

**User Personas:**
- *Ops teams*: Perfect for production tuning
- *Developers*: Frustrating when behavior differs between environments
- *Testers*: Need to replicate production config

---

## Comparison Matrix

| Criteria | Activity | System Op | continue_as_new param | Registration | Runtime Config |
|----------|----------|-----------|----------------------|--------------|----------------|
| Explicit in code ||||||
| Zero boilerplate ||| ⚠️ |||
| Flexible timing ||||||
| Hard to forget ||| ⚠️ |||
| Low overhead ||||||
| Testable |||| ⚠️ ||
| No API changes ||||||
| Ops-friendly |||| ⚠️ ||

---

## Recommendation

Implement in phases:

### Phase 1: System Activity (Option A)
Add `ctx.prune_history()` as a low-overhead system operation. This gives users explicit control with less boilerplate than the activity approach.

```rust
ctx.prune_history(PruneOptions { keep_last: Some(1), ..Default::default() }).await;
ctx.continue_as_new(state).await
```

### Phase 2: Registration Option (Option C)
For users who want "set and forget", add declarative policy at registration. This complements Phase 1 for different use cases.

```rust
.register_with_options("Eternal", fn, OrchestrationOptions {
    auto_prune_on_continue: Some(1),  // keep_last
    ..Default::default()
})
```

### Phase 3 (Maybe): Runtime Config (Option D)
If there's demand from ops teams, add runtime configuration as an override mechanism. This should be additive, not replacing the code-level options.

---

## Interaction with Versioning and Replay

### Why Pruning is Safe for Replay

Each execution is **self-contained** with its own complete history:

```
Execution 1: [OrchStarted] → [ActivityScheduled] → [ActivityCompleted] → [ContinuedAsNew]
Execution 2: [OrchStarted] → [TimerScheduled] → [TimerFired] → [ContinuedAsNew]
Execution 3: [OrchStarted] → [ActivityScheduled] → ... (current)
```

When replaying execution 3 after a crash:
- Runtime loads execution 3's history only
- Replays from its `OrchestrationStarted` event
- Executions 1 and 2 are **irrelevant** to replay

**Pruning executions 1 and 2 does not affect execution 3's replay correctness.**

### Version Transitions

Consider an orchestration upgrading across versions:

```
Exec 1 (v1.0): ProcessBatch → ContinueAsNew
Exec 2 (v1.0): ProcessBatch → ContinueAsNew
Exec 3 (v2.0): ProcessBatchV2 → ContinueAsNew  ← version upgrade here
Exec 4 (v2.0): ProcessBatchV2 → ...
```

**Scenario A: Prune during normal operation**
- Prune keeping last 2 executions (3 and 4)
- Execution 3 replays correctly using v2.0 code
- Execution 4 replays correctly using v2.0 code
- ✅ Safe

**Scenario B: Rollback to v1.0 after pruning**
- Executions 1 and 2 (v1.0) are gone
- Execution 3 and 4 have v2.0 history (different activity names/shapes)
- Replaying with v1.0 code → **NonDeterminismError**
- ❌ Cannot rollback

**Scenario C: Debugging v1.0 behavior**
- Bug reported: "v1.0 produced wrong output"
- Executions 1 and 2 are pruned
- Cannot inspect what v1.0 did
- ❌ Lost observability

### Recommendations for Version Transitions

1. **Increase retention during upgrades**
   ```rust
   // During version transition, keep more history
   ctx.prune_history(PruneOptions {
       keep_last: Some(10),  // instead of 1
       ..Default::default()
   }).await;
   ```

2. **Time-based retention for auditing**
   ```rust
   // Keep executions from last 7 days regardless of count
   ctx.prune_history(PruneOptions {
       keep_last: Some(1),
       completed_before: Some(now - 7.days()),  // AND semantics
       ..Default::default()
   }).await;
   ```

3. **Disable auto-prune during rollout**
   ```rust
   RuntimeOptions {
       auto_prune: AutoPruneConfig {
           enabled: !is_canary_deployment(),
           ..Default::default()
       },
   }
   ```

### What's Preserved vs Lost

| Data | After Pruning | Impact |
|------|---------------|--------|
| Current execution history | ✅ Preserved | Replay works |
| Current execution input | ✅ Preserved | State available |
| Previous execution outputs | ❌ Lost | Can't inspect old results |
| Previous version behavior | ❌ Lost | Can't debug old code |
| Sub-orchestration references | ⚠️ Partial | Sub-orch exists, but parent's spawn record gone |
| Event delivery records | ❌ Lost | Can't see events raised to old executions |

### Sub-Orchestrations and Pruning

Pruning parent executions does **not** cascade to sub-orchestrations:

```
Parent Exec 1: SpawnChild(C1) → ContinueAsNew
Parent Exec 2: SpawnChild(C2) → ContinueAsNew
Parent Exec 3: ... (current)

Child C1: [independent instance with own history]
Child C2: [independent instance with own history]
```

If we prune parent execution 1:
- Child C1 **still exists** (separate instance)
- Parent's `SubOrchestrationScheduled` event for C1 is **gone**
- We lose the audit trail of "who spawned C1"

### Unobserved Sub-Orchestration Completions

A more subtle issue arises with sub-orchestrations that are spawned but not awaited:

**Scenario 1: Select2 loser**
```
Parent Exec 1:
  - SubOrchScheduled(C1, event_id=2)
  - SubOrchScheduled(C2, event_id=3)
  - Select2 → C1 wins
  - SubOrchCompleted(C1)
  - ContinueAsNew            ← C2 still running!

Parent Exec 2:
  - Prune(keep_last=1)       ← Exec 1 deleted
  - ...working...

Child C2: completes, sends SubOrchCompleted to parent
```

**Scenario 2: Unawaited DurableFuture**
```rust
// Parent code - BUG: forgot to await child2
let child1 = ctx.schedule_sub_orchestration("Fast", "").into_sub_orchestration().await?;
let child2 = ctx.schedule_sub_orchestration("Slow", "");  // never awaited!
ctx.continue_as_new(state).await
```

```
Parent Exec 1:
  - SubOrchScheduled(C1)
  - SubOrchCompleted(C1)
  - SubOrchScheduled(C2)     ← scheduled but never awaited
  - ContinueAsNew

Child C2: running, will complete eventually
```

**What happens when orphaned child completes?**

1. Child C2 completes, runtime appends `SubOrchCompleted` to parent's history
2. Parent's current execution (Exec 2) replays
3. Exec 2's code never scheduled C2, so completion is **ignored** (execution_id filtering)
4. Same behavior with or without pruning

**Key insight:** Pruning doesn't create new problems here - the orphaned completion was already going to be ignored due to execution_id mismatch. The existing `continue_as_new` semantics handle this.

### The Real Problem: Resource Leaks

The issue isn't correctness, it's **cleanup**:

```
After pruning Exec 1:
- C2 instance still exists in database
- C2 has parent_instance_id = "parent"
- C2 might be: Running, Completed, or Failed
- Nobody is waiting for C2
- C2 is effectively orphaned
```

**Current state:**
| Child State | Parent Exec Pruned? | Outcome |
|-------------|---------------------|---------|
| Running | No | Select2 loser continues running |
| Running | Yes | Same - still running, orphaned |
| Completed | No | Completion ignored by parent |
| Completed | Yes | Same - completed, orphaned |
| Failed | No | Failure ignored by parent |
| Failed | Yes | Same - failed, orphaned |

Pruning doesn't change the orphaned child's fate - it was already orphaned by the select2 or unawaited future.

### Recommendations

1. **For select2 with sub-orchestrations: Cancel the loser**
   ```rust
   let (winner_idx, result) = ctx.select2(child1, child2).await;
   // Explicitly cancel the loser
   if winner_idx == 0 {
       // child2 lost - it gets cancelled automatically by select2
   }
   ```

   Note: `select2` already cancels the loser. But if cancellation doesn't propagate (child ignores it), the child continues.

2. **For fire-and-forget: Use detached orchestrations**
   ```rust
   // Use schedule_orchestration (detached) instead of schedule_sub_orchestration
   ctx.schedule_orchestration("Worker", "child-1", input);
   // Detached orchestrations are independent - no parent relationship
   ```

3. **Consider adding orphan cleanup (future feature)**
   ```rust
   // Management API to find and clean orphaned sub-orchestrations
   client.cleanup_orphaned_instances(OrphanFilter {
       parent_completed_before: Some(now - 7.days()),
       child_status: vec![Status::Completed, Status::Failed],
   }).await;
   ```

4. **Audit logging for spawn events**
   If you need to track "who spawned whom" after pruning, emit explicit trace events:
   ```rust
   ctx.trace_info(&format!("Spawning sub-orchestration: {}", child_id));
   let child = ctx.schedule_sub_orchestration("Worker", input);
   ```

### Summary: Pruning + Unobserved Completions

| Concern | Impact | Mitigation |
|---------|--------|------------|
| Correctness | ✅ None - execution_id filtering already handles this | N/A |
| Orphaned running children | ⚠️ Continue running | Cancel losers explicitly |
| Orphaned completed children | ⚠️ Sit in DB forever | Future: orphan cleanup API |
| Lost spawn audit trail | ⚠️ Can't trace lineage | Use explicit trace events |

**Recommendation:** For orchestrations with sub-orchestrations, consider keeping more history or using separate audit logging.

### Nondeterminism Detection

Nondeterminism is detected **within a single execution** during replay:

```
Execution 3 Replay:
  History: [OrchStarted, ActivityScheduled(id=2, name="Foo")]
  Code:    ctx.schedule_activity("Bar", ...)  // Different name!
  Result:  NonDeterminismError
```

Pruning old executions has **no effect** on nondeterminism detection because:
1. Detection compares code behavior vs current execution's history
2. Old executions' histories are never consulted during replay
3. Each execution is replayed independently

### Edge Case: Crash During Prune + ContinueAsNew

If using atomic prune-and-continue (Option B):

```
1. Orchestration calls continue_as_new_pruned(state, keep_last=1)
2. Runtime starts transaction:
   a. Delete old executions
   b. Create new execution
   c. CRASH before commit
3. On recovery: transaction rolled back
4. Old executions still exist, new execution not created
5. Orchestration replays, calls continue_as_new_pruned again
6. ✅ Idempotent, no data loss
```

If using separate prune-then-continue:

```
1. Orchestration calls prune_history(keep_last=1)
2. Prune completes, event recorded
3. Orchestration calls continue_as_new(state)
4. CRASH before continue completes
5. On recovery: prune event in history
6. Replay: prune is idempotent (already done), continue proceeds
7. ✅ Safe, but two operations instead of one
```

---

## Open Questions

1. **Should pruning be sync or async?** System activity could be fire-and-forget (don't wait for completion) to minimize latency.

2. **What if pruning fails?** Should it block `continue_as_new`? Probably not - pruning failure shouldn't break the orchestration.

3. **Metrics/observability?** Should emit metrics for pruned executions (e.g., `duroxide_executions_pruned_total`).

4. **Interaction with `completed_before`?** Should auto-prune support time-based retention, or just count-based?

5. **Per-instance override?** Can an instance opt-out of registration-level auto-prune?

---

## Appendix: Event Schema

If implementing Option A (System Activity):

```rust
enum EventKind {
    // ... existing variants ...

    /// History pruning was requested
    HistoryPruneRequested {
        options: PruneOptions,
    },

    /// History pruning completed
    HistoryPruned {
        executions_deleted: u64,
        events_deleted: u64,
    },
}

enum Action {
    // ... existing variants ...

    PruneHistory {
        scheduling_event_id: u64,
        options: PruneOptions,
    },
}
```