duroxide 0.1.27

Durable code execution framework for Rust
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
# Management API: Deletion and Pruning

> **Status**: Implemented
> **Related**: `docs/management-api-improvements-proposal.md`

## 1. Summary

This proposal adds explicit deletion and retention capabilities to duroxide:

1. **Instance Deletion**: Remove terminated orchestrations and all associated data (`delete_instance`)
2. **Bulk Instance Deletion**: Delete multiple instances by ID and/or time-based filter (`delete_instance_bulk`)
3. **Execution Pruning**: Remove old executions from long-running `ContinueAsNew` chains (`prune_executions`)
4. **Bulk Execution Pruning**: Prune executions across multiple instances matching a filter (`prune_executions_bulk`)

## 2. Motivation

Currently, duroxide retains all history and execution data indefinitely. For long-running applications or high-volume systems, this leads to:

- **Unbounded Storage Growth**: "Eternal" orchestrations using `ContinueAsNew` accumulate history for every past execution.
- **Operational Clutter**: Terminated or failed test instances remain in the database.
- **Compliance/Privacy**: Inability to delete user-specific data (GDPR/CCPA "Right to be Forgotten").
- **No Retention Policy**: Operators cannot say "delete all completed workflows older than 30 days".

## 3. API Naming Philosophy

We use consistent terminology across the API:

| Term | Meaning | Scope |
|------|---------|-------|
| **`delete_instance`** | Permanently remove an orchestration instance and ALL its data | Single instance |
| **`delete_instance_bulk`** | Bulk delete multiple instances matching criteria | Multi-instance |
| **`prune_executions`** | Remove old executions while keeping the instance alive | Single instance |
| **`prune_executions_bulk`** | Prune executions across multiple instances | Multi-instance |

**Rationale:**
- `delete_instance` is the standard term for removing a single identified resource
- `delete_instance_bulk` uses the same verb with `_bulk` suffix for batch operations (consistent pattern)
- `prune` implies trimming while preserving the living entity (the instance continues to exist)
- Both single and bulk deletion return the same `DeleteInstanceResult` type for API consistency

---

## 4. Common Types

### 4.1 InstanceFilter

A reusable filter for selecting instances across management APIs.

```rust
/// Filter criteria for selecting orchestration instances.
/// 
/// Used by purge, bulk prune, and potentially future management APIs.
/// When multiple criteria are provided, they are ANDed together.
#[derive(Debug, Clone, Default)]
pub struct InstanceFilter {
    /// Specific instance IDs to select.
    /// When provided with other filters, acts as an allowlist that is
    /// further filtered by the other criteria.
    pub instance_ids: Option<Vec<String>>,
    
    /// Only select instances whose current execution completed before this time.
    /// Value is milliseconds since Unix epoch.
    pub completed_before: Option<u64>,
    
    /// Maximum number of instances to select.
    /// Use for batching large operations.
    /// Default: 1000
    pub limit: Option<u32>,
}
```

**Filter Semantics (AND):**
- If `instance_ids` AND `completed_before` are both set, only instances that are in the ID list AND completed before the cutoff are selected.
- `limit` is always applied last, after other filters.

**Examples:**
```rust
// Select specific instances
InstanceFilter {
    instance_ids: Some(vec!["order-1".into(), "order-2".into()]),
    ..Default::default()
}

// Select by time (retention policy)
InstanceFilter {
    completed_before: Some(five_days_ago_ms),
    limit: Some(500),
    ..Default::default()
}

// Select specific instances that are also old (intersection)
InstanceFilter {
    instance_ids: Some(vec!["order-1".into(), "order-2".into(), "order-3".into()]),
    completed_before: Some(five_days_ago_ms), // Only delete if also old
    ..Default::default()
}
```

### 4.2 PruneOptions

Options for pruning executions. When multiple criteria are provided, they are ANDed.

```rust
/// Options for pruning old executions.
/// 
/// When multiple criteria are provided, they are ANDed together.
/// The current (active) execution is NEVER pruned regardless of these options.
#[derive(Debug, Clone, Default)]
pub struct PruneOptions {
    /// Keep the last N executions (by execution_id).
    /// Executions outside the top N are eligible for deletion.
    pub keep_last: Option<u32>,
    
    /// Only delete executions completed before this time (milliseconds since epoch).
    pub completed_before: Option<u64>,
}
```

**Filter Semantics (AND):**
- If both `keep_last: Some(5)` and `completed_before: Some(ts)` are set, an execution is deleted only if it's outside the last 5 AND completed before the cutoff.
- The current execution (`current_execution_id`) is **NEVER** deleted regardless of options.
- An execution with `status = 'Running'` is **NEVER** pruned.

### 4.3 Result Types

```rust
/// Result of instance deletion (single or bulk).
///
/// Used by both `delete_instance` and `delete_instance_bulk` for API consistency.
#[derive(Debug, Clone, Default)]
pub struct DeleteInstanceResult {
    /// Number of instances deleted (1 for single instance, N for bulk).
    pub instances_deleted: u64,
    /// Number of executions deleted.
    pub executions_deleted: u64,
    /// Number of history events deleted.
    pub events_deleted: u64,
    /// Number of queue messages deleted (orchestrator + worker + timer queues).
    pub queue_messages_deleted: u64,
}

/// Result of an execution prune operation.
#[derive(Debug, Clone, Default)]
pub struct PruneResult {
    /// Number of instances processed (1 for single instance prune, N for bulk).
    pub instances_processed: u64,
    /// Number of executions deleted.
    pub executions_deleted: u64,
    /// Number of history events deleted.
    pub events_deleted: u64,
}
```

**Design Decision:** We use a single `DeleteInstanceResult` type for both single and bulk deletion operations. The `instances_deleted` field is `u64` rather than `bool` to support both cases uniformly. For single instance deletion, this will be `1` on success or `0` if the instance wasn't found.

---

## 5. User-Facing API (Client)

These methods are exposed on the `Client` struct for end-user consumption.

### 5.1 Delete Single Instance

```rust
impl Client {
    /// Delete an orchestration instance and all its data.
    ///
    /// Removes the instance, all executions, all history events, and any
    /// pending queue messages for this instance.
    ///
    /// # Parameters
    /// * `instance_id` - The ID of the instance to delete
    /// * `force` - If true, delete even if instance has status='Running'
    ///
    /// # The `force` Parameter
    /// 
    /// When `force=false` (default):
    /// - Only deletes instances in terminal states (Completed, Failed)
    /// - Returns `Err(ClientError::InstanceStillRunning)` if status is Running
    ///
    /// When `force=true`:
    /// - Deletes the instance regardless of status
    /// - **Only affects database state** — does NOT kill in-flight tokio tasks
    /// - Use for instances that are "waiting" (between turns) but need removal
    /// - If an orchestration turn is actively executing, it will fail when
    ///   trying to persist state (instance gone)
    /// - If activities are running, they will detect deletion via lock renewal
    ///   failure and can terminate gracefully
    ///
    /// **Recommended pattern:** Use `cancel_instance()` first for graceful
    /// shutdown. Only use `force=true` for cleanup of instances stuck in
    /// Running state that won't respond to cancellation.
    ///
    /// # Returns
    /// * `Ok(DeleteInstanceResult)` - Details of what was deleted (`instances_deleted` will be 1)
    /// * `Err(ClientError::InstanceStillRunning)` - Instance is running and force=false
    /// * `Err(ClientError::InstanceNotFound)` - Instance doesn't exist
    ///
    /// # Example
    /// ```ignore
    /// // Delete a completed instance
    /// let result = client.delete_instance("order-123", false).await?;
    /// println!("Deleted {} events", result.events_deleted);
    ///
    /// // Graceful pattern: cancel first, then delete
    /// client.cancel_instance("workflow-456").await?;
    /// // Wait for cancellation to complete...
    /// client.delete_instance("workflow-456", false).await?;
    ///
    /// // Force delete an instance stuck in Running state
    /// // (e.g., waiting on event that will never arrive, cancel didn't help)
    /// client.delete_instance("stuck-workflow", true).await?;
    /// ```
    pub async fn delete_instance(
        &self,
        instance_id: &str,
        force: bool,
    ) -> Result<DeleteInstanceResult, ClientError>;
}
```

### 5.2 Bulk Delete Instances

```rust
impl Client {
    /// Delete multiple orchestration instances matching the filter criteria.
    ///
    /// This is the primary API for retention-based cleanup. Only instances
    /// in terminal states (Completed, Failed) are eligible for deletion.
    /// Running instances are always skipped (not an error).
    ///
    /// # Filter Behavior
    ///
    /// All filter criteria are ANDed together:
    /// - `instance_ids` + `completed_before`: Only delete IDs that are also old
    /// - `limit` is applied after other filters
    ///
    /// # Examples
    /// ```ignore
    /// // Delete specific instances
    /// let result = client.delete_instance_bulk(InstanceFilter {
    ///     instance_ids: Some(vec!["order-1".into(), "order-2".into()]),
    ///     ..Default::default()
    /// }).await?;
    ///
    /// // Delete by age (retention policy)
    /// let five_days_ago = now_ms - (5 * 24 * 60 * 60 * 1000);
    /// let result = client.delete_instance_bulk(InstanceFilter {
    ///     completed_before: Some(five_days_ago),
    ///     limit: Some(500),
    ///     ..Default::default()
    /// }).await?;
    ///
    /// // Delete specific instances only if they're old
    /// let result = client.delete_instance_bulk(InstanceFilter {
    ///     instance_ids: Some(vec!["order-1".into(), "order-2".into()]),
    ///     completed_before: Some(five_days_ago),
    ///     ..Default::default()
    /// }).await?;
    /// ```
    ///
    /// # Safety
    /// - Running instances are NEVER deleted (silently skipped)
    /// - Use `limit` to avoid long-running transactions
    pub async fn delete_instance_bulk(
        &self,
        filter: InstanceFilter,
    ) -> Result<DeleteInstanceResult, ClientError>;
}
```

### 5.3 Prune Executions (Single Instance)

```rust
impl Client {
    /// Prune old executions from a single long-running instance.
    ///
    /// Use this for `ContinueAsNew` workflows that accumulate many executions
    /// over time. The current (active) execution is never pruned.
    ///
    /// # Filter Behavior
    /// 
    /// When both options are set, they are ANDed:
    /// - `keep_last: 5` + `completed_before: ts` = delete executions outside 
    ///   the last 5 that are ALSO older than the cutoff
    ///
    /// # Examples
    /// ```ignore
    /// // Keep only the last 10 executions
    /// let result = client.prune_executions("eternal-workflow", PruneOptions {
    ///     keep_last: Some(10),
    ///     ..Default::default()
    /// }).await?;
    ///
    /// // Delete executions older than 30 days
    /// let thirty_days_ago = now_ms - (30 * 24 * 60 * 60 * 1000);
    /// let result = client.prune_executions("eternal-workflow", PruneOptions {
    ///     completed_before: Some(thirty_days_ago),
    ///     ..Default::default()
    /// }).await?;
    ///
    /// // Combined: delete old executions, but always keep at least 5
    /// let result = client.prune_executions("eternal-workflow", PruneOptions {
    ///     keep_last: Some(5),
    ///     completed_before: Some(thirty_days_ago),
    ///     ..Default::default()
    /// }).await?;
    /// ```
    pub async fn prune_executions(
        &self,
        instance_id: &str,
        options: PruneOptions,
    ) -> Result<PruneResult, ClientError>;
}
```

### 5.4 Bulk Prune Executions

```rust
impl Client {
    /// Prune old executions from multiple instances matching the filter.
    ///
    /// Applies the same prune options to all matching instances. Useful for
    /// system-wide retention policies on long-running workflows.
    ///
    /// # Examples
    /// ```ignore
    /// // Prune all terminal instances: keep last 10 executions each
    /// let result = client.prune_executions_bulk(
    ///     InstanceFilter {
    ///         completed_before: None,  // All terminal instances
    ///         limit: Some(100),
    ///         ..Default::default()
    ///     },
    ///     PruneOptions {
    ///         keep_last: Some(10),
    ///         ..Default::default()
    ///     },
    /// ).await?;
    ///
    /// // Prune specific instances: delete executions older than 30 days
    /// let result = client.prune_executions_bulk(
    ///     InstanceFilter {
    ///         instance_ids: Some(vec!["workflow-a".into(), "workflow-b".into()]),
    ///         ..Default::default()
    ///     },
    ///     PruneOptions {
    ///         completed_before: Some(thirty_days_ago),
    ///         ..Default::default()
    ///     },
    /// ).await?;
    /// println!("Pruned {} executions across {} instances", 
    ///     result.executions_deleted, result.instances_processed);
    /// ```
    ///
    /// # Notes
    /// - Only processes instances in terminal states (Completed, Failed)
    /// - Running instances are skipped
    /// - The current execution of each instance is never pruned
    /// - Active (Running) executions are never pruned
    pub async fn prune_executions_bulk(
        &self,
        filter: InstanceFilter,
        options: PruneOptions,
    ) -> Result<PruneResult, ClientError>;
}
```

---

## 6. Provider API (ProviderAdmin)

The `ProviderAdmin` trait is the internal interface that storage backends implement.
The `Client` methods delegate to these after validation.

### 6.1 Provider Trait Extensions

```rust
#[async_trait::async_trait]
pub trait ProviderAdmin: Any + Send + Sync {
    // ... existing methods ...

    // ===== Deletion Operations =====

    /// Delete a single instance and all associated data.
    ///
    /// # Implementation Requirements
    /// - Delete from: history, executions, instances tables
    /// - Delete from: orchestrator_queue, worker_queue (by instance_id)
    /// - Delete from: instance_locks
    /// - Must be transactional (all-or-nothing)
    /// - Return error if status is 'Running' and force=false
    async fn delete_instance(
        &self,
        instance_id: &str,
        force: bool,
    ) -> Result<DeleteInstanceResult, ProviderError>;

    /// Delete multiple instances matching the filter.
    ///
    /// # Implementation Requirements
    /// - AND all filter criteria together
    /// - Skip any instance with status='Running' (don't error)
    /// - Should be efficient (batch SQL operations)
    /// - Apply limit after other filters
    async fn delete_instance_bulk(
        &self,
        filter: InstanceFilter,
    ) -> Result<DeleteInstanceResult, ProviderError>;

    /// Prune old executions from a single instance.
    ///
    /// # Implementation Requirements  
    /// - NEVER delete the current_execution_id
    /// - NEVER delete executions with status='Running'
    /// - AND the options together (both must match for deletion)
    /// - When keep_last is set: executions outside top N are eligible
    /// - When completed_before is set: executions older than cutoff are eligible
    async fn prune_executions(
        &self,
        instance_id: &str,
        options: &PruneOptions,
    ) -> Result<PruneResult, ProviderError>;

    /// Prune old executions from multiple instances matching the filter.
    ///
    /// # Implementation Requirements
    /// - Select instances using InstanceFilter (AND semantics)
    /// - Apply PruneOptions to each selected instance (AND semantics)
    /// - Skip running instances
    /// - Never delete current_execution_id of any instance
    /// - Never delete executions with status='Running'
    async fn prune_executions_bulk(
        &self,
        filter: &InstanceFilter,
        options: &PruneOptions,
    ) -> Result<PruneResult, ProviderError>;
}
```

### 6.2 Client → Provider Mapping

| Client Method | Provider Method |
|---------------|-----------------|
| `delete_instance(id, force)` | `delete_instance(id, force)` |
| `delete_instance_bulk(filter)` | `delete_instance_bulk(filter)` |
| `prune_executions(id, options)` | `prune_executions(id, &options)` |
| `prune_executions_bulk(filter, options)` | `prune_executions_bulk(&filter, &options)` |

### 6.3 Provider Simplification: Primitives vs Composites

To reduce the implementation burden on provider developers, we separate the `ProviderAdmin` trait into:

1. **Primitives** - Simple database operations providers MUST implement
2. **Composites** - Complex operations with DEFAULT implementations using primitives

This means provider developers only implement 3-4 simple methods, and get all cascade/tree logic for free.

#### Primitive Methods (Provider Implements)

```rust
#[async_trait::async_trait]
pub trait ProviderAdmin: Any + Send + Sync {
    // ... existing read-only methods ...

    // ===== Primitive Hierarchy Operations =====

    /// List direct children of an instance.
    ///
    /// Returns instance IDs that have `parent_instance_id = instance_id`.
    /// Returns empty vec if instance has no children or doesn't exist.
    async fn list_children(&self, instance_id: &str) -> Result<Vec<String>, ProviderError>;

    /// Get the parent instance ID.
    ///
    /// Returns `Some(parent_id)` for sub-orchestrations, `None` for root orchestrations.
    /// Returns `Err` if instance doesn't exist.
    async fn get_parent_id(&self, instance_id: &str) -> Result<Option<String>, ProviderError>;

    /// Atomically delete a batch of instances.
    ///
    /// # Orphan Validation
    /// This method MUST validate that no orphans would be created:
    /// - If any instance in `ids` has children that are NOT in `ids`, return error
    /// - If any instance in `ids` has a parent that IS in `ids`, the parent must appear
    ///   AFTER the child (children deleted before parents)
    ///
    /// # Transaction Semantics
    /// All instances must be deleted atomically (all-or-nothing).
    ///
    /// # Status Check
    /// If `force=false`, all instances must be in terminal state.
    /// If `force=true`, delete regardless of status.
    async fn delete_instances_atomic(
        &self,
        ids: &[String],
        force: bool,
    ) -> Result<DeleteInstanceResult, ProviderError>;
}
```

#### Composite Methods (Default Implementations)

```rust
#[async_trait::async_trait]
pub trait ProviderAdmin: Any + Send + Sync {
    // ... primitives above ...

    // ===== Composite Operations (default implementations) =====

    /// Get the full instance tree rooted at the given instance.
    ///
    /// Returns all instances in the tree: the root, all children, grandchildren, etc.
    /// Ordered for safe deletion: children before parents (depth-first, leaves first).
    ///
    /// Default implementation uses `list_children` recursively.
    async fn get_instance_tree(&self, instance_id: &str) -> Result<InstanceTree, ProviderError> {
        // Default implementation using list_children
        let mut tree = InstanceTree {
            root_id: instance_id.to_string(),
            all_ids: vec![],
        };

        // BFS to collect all descendants
        let mut to_process = vec![instance_id.to_string()];
        while let Some(parent_id) = to_process.pop() {
            tree.all_ids.push(parent_id.clone());
            let children = self.list_children(&parent_id).await?;
            to_process.extend(children);
        }

        // Reverse for depth-first deletion order (children before parents)
        tree.all_ids.reverse();
        Ok(tree)
    }

    /// Delete a single instance (and all descendants if root).
    ///
    /// Default implementation:
    /// 1. Check if instance has a parent (is sub-orchestration)
    /// 2. If sub-orchestration: return error (must delete root)
    /// 3. Get full tree via `get_instance_tree`
    /// 4. Call `delete_instances_atomic` with all IDs
    async fn delete_instance(
        &self,
        instance_id: &str,
        force: bool,
    ) -> Result<DeleteInstanceResult, ProviderError> {
        // Step 1: Check if this is a sub-orchestration
        let parent = self.get_parent_id(instance_id).await?;
        if parent.is_some() {
            return Err(ProviderError::permanent(
                "delete_instance",
                format!(
                    "Cannot delete sub-orchestration {} directly. Delete root instance instead.",
                    instance_id
                ),
            ));
        }

        // Step 2: Get full tree
        let tree = self.get_instance_tree(instance_id).await?;

        // Step 3: Atomic delete (tree.all_ids is already in deletion order)
        self.delete_instances_atomic(&tree.all_ids, force).await
    }

    /// Delete multiple instances matching filter.
    ///
    /// Default implementation iterates through matches and calls delete_instance.
    /// Provider can override for better performance (batch operations).
    async fn delete_instance_bulk(&self, filter: InstanceFilter) -> Result<DeleteInstanceResult, ProviderError>;
}
```

#### New Type: InstanceTree

```rust
/// Represents an instance and all its descendants.
///
/// Used for inspecting hierarchies before deletion, or for understanding
/// sub-orchestration relationships.
#[derive(Debug, Clone)]
pub struct InstanceTree {
    /// The root instance ID.
    pub root_id: String,

    /// All instance IDs in the tree (including root).
    /// Ordered for safe deletion: children before parents.
    pub all_ids: Vec<String>,
}

impl InstanceTree {
    /// Returns true if this tree contains only the root (no children/descendants).
    pub fn is_root_only(&self) -> bool {
        self.all_ids.len() == 1
    }

    /// Returns the number of instances in the tree.
    pub fn size(&self) -> usize {
        self.all_ids.len()
    }
}
```

#### Provider Implementation Burden

| Before | After |
|--------|-------|
| Implement `delete_instance` with full cascade logic | Implement `list_children` (1 query) |
| Implement `purge_instances` with tree traversal | Implement `get_parent_id` (1 query) |
| Handle orphan validation | Implement `delete_instances_atomic` (batch delete) |
| ~200 lines per provider | ~50 lines per provider |

#### Example SQLite Primitive Implementation

```rust
impl ProviderAdmin for SqliteProvider {
    async fn list_children(&self, instance_id: &str) -> Result<Vec<String>, ProviderError> {
        let rows = sqlx::query("SELECT instance_id FROM instances WHERE parent_instance_id = ?")
            .bind(instance_id)
            .fetch_all(&self.pool)
            .await?;
        Ok(rows.iter().map(|r| r.get("instance_id")).collect())
    }

    async fn get_parent_id(&self, instance_id: &str) -> Result<Option<String>, ProviderError> {
        let row = sqlx::query("SELECT parent_instance_id FROM instances WHERE instance_id = ?")
            .bind(instance_id)
            .fetch_optional(&self.pool)
            .await?;
        match row {
            Some(r) => Ok(r.get("parent_instance_id")),
            None => Err(ProviderError::permanent("get_parent_id", "Instance not found")),
        }
    }

    async fn delete_instances_atomic(
        &self,
        ids: &[String],
        force: bool,
    ) -> Result<DeleteResult, ProviderError> {
        let mut tx = self.pool.begin().await?;

        // Status check (if not force)
        if !force {
            // Check all instances are terminal
            // ...
        }

        // Delete all instances
        let mut result = DeleteResult::default();
        for id in ids {
            // Delete from all tables
            // Aggregate counts into result
        }

        tx.commit().await?;
        Ok(result)
    }

    // delete_instance: uses default implementation!
    // get_instance_tree: uses default implementation!
}
```

#### Public API: get_instance_tree

The `get_instance_tree` method is also exposed via the `Client` API for users who want to inspect the hierarchy before deletion:

```rust
impl Client {
    /// Get the full instance tree rooted at the given instance.
    ///
    /// Useful for inspecting hierarchy before deletion, or for
    /// understanding sub-orchestration relationships.
    ///
    /// # Returns
    /// * `Ok(InstanceTree)` - The tree with all descendant IDs
    /// * `Err(ClientError::InstanceNotFound)` - Instance doesn't exist
    ///
    /// # Example
    /// ```ignore
    /// let tree = client.get_instance_tree("order-123").await?;
    /// println!("Will delete {} instances", tree.size());
    /// for id in &tree.all_ids {
    ///     println!("  - {}", id);
    /// }
    /// client.delete_instance("order-123", false).await?;
    /// ```
    pub async fn get_instance_tree(&self, instance_id: &str) -> Result<InstanceTree, ClientError>;
}
```

---

## 7. Detailed Design

### 7.1 Schema Requirements

The following columns are required for time-based operations:

| Table | Column | Type | Purpose |
|-------|--------|------|---------|
| `executions` | `completed_at` | INTEGER (epoch ms) | Anchor for retention queries |
| `executions` | `status` | TEXT | Filter terminal vs running |
| `instances` | `current_execution_id` | INTEGER | Identify the active execution |

### 7.2 Timestamp Storage

All timestamps MUST be stored as **INTEGER milliseconds since Unix epoch** (Rust time via `SystemTime::now()`), NOT SQLite's `CURRENT_TIMESTAMP` (which produces TEXT).

**Rationale:**
- Consistent with existing `visible_at`, `locked_until` columns
- Enables simple numeric comparisons: `completed_at < ?`
- Avoids SQLite datetime parsing complexity
- Matches the `u64` epoch milliseconds used in the management API

**Implementation:** Use `Self::now_millis()` helper (already exists in SqliteProvider).

### 7.3 SQL Implementation Sketches

**Delete Single Instance:**
```sql
BEGIN TRANSACTION;

-- Safety check (unless force=true)
SELECT e.status FROM instances i
JOIN executions e ON i.instance_id = e.instance_id 
    AND i.current_execution_id = e.execution_id
WHERE i.instance_id = ?;
-- If status = 'Running' AND NOT force: ROLLBACK and return error

-- Delete all related data
DELETE FROM history WHERE instance_id = ?;
DELETE FROM executions WHERE instance_id = ?;
DELETE FROM orchestrator_queue WHERE instance_id = ?;
DELETE FROM worker_queue WHERE instance_id = ?;
DELETE FROM instance_locks WHERE instance_id = ?;
DELETE FROM instances WHERE instance_id = ?;

COMMIT;
```

**Purge Instances (with ANDed filters):**
```sql
BEGIN TRANSACTION;

-- Build WHERE clause dynamically based on filter
-- Base: only terminal instances
CREATE TEMP TABLE purge_candidates AS
SELECT i.instance_id
FROM instances i
JOIN executions e ON i.instance_id = e.instance_id 
    AND i.current_execution_id = e.execution_id
WHERE e.status IN ('Completed', 'Failed')
  -- AND instance_id IN (...) if instance_ids provided
  -- AND e.completed_at < ? if completed_before provided
LIMIT ?;

-- Delete all related data for candidates
DELETE FROM history WHERE instance_id IN (SELECT instance_id FROM purge_candidates);
DELETE FROM executions WHERE instance_id IN (SELECT instance_id FROM purge_candidates);
DELETE FROM orchestrator_queue WHERE instance_id IN (SELECT instance_id FROM purge_candidates);
DELETE FROM worker_queue WHERE instance_id IN (SELECT instance_id FROM purge_candidates);
DELETE FROM instance_locks WHERE instance_id IN (SELECT instance_id FROM purge_candidates);
DELETE FROM instances WHERE instance_id IN (SELECT instance_id FROM purge_candidates);

DROP TABLE purge_candidates;
COMMIT;
```

**Prune Executions (ANDed options):**
```sql
BEGIN TRANSACTION;

-- Get current execution (never delete)
SELECT current_execution_id FROM instances WHERE instance_id = ?;

-- Find executions to prune
-- Must satisfy ALL provided conditions
-- Never prune Running executions or current_execution_id
CREATE TEMP TABLE prune_candidates AS
SELECT execution_id FROM executions
WHERE instance_id = ?
  AND execution_id != ?  -- current_execution_id
  AND status != 'Running'  -- never prune active executions
  -- AND execution_id NOT IN (top N by execution_id) if keep_last provided
  -- AND completed_at < ? if completed_before provided
;

DELETE FROM history 
WHERE instance_id = ? 
  AND execution_id IN (SELECT execution_id FROM prune_candidates);

DELETE FROM executions 
WHERE instance_id = ? 
  AND execution_id IN (SELECT execution_id FROM prune_candidates);

DROP TABLE prune_candidates;
COMMIT;
```

---

## 8. Edge Cases & Implications

### 8.1 Zombie Messages

When an instance is deleted, queued messages (e.g., `ExternalRaised`) may still exist.

**Problem:** `fetch_orchestration_item` picks up a message for a deleted instance.

**Resolution:** The orchestration dispatcher already handles missing instances gracefully. When loading instance metadata fails with "not found", the message should be acked without processing and a warning logged. This prevents poison message loops.

### 8.2 Force Delete: What It Does (and Doesn't Do)

`force=true` deletes **database state only**. It does NOT:
- Kill in-flight tokio tasks executing orchestration turns
- Abort activity code mid-execution
- Provide any runtime signal to running code

**What happens with `force=true` in various scenarios:**

| Scenario | What Happens | Result |
|----------|--------------|--------|
| Instance waiting (between turns) | DB records deleted, pending queue messages removed | Clean deletion |
| Orchestration turn actively executing | Turn completes, tries to persist → fails (no instance) | Turn result lost, error logged |
| Activity executing (worker has lock) | Worker's next lock renewal fails → detects "cancellation" | Activity can terminate gracefully |
| Activity completes during delete | `ack_work_item` fails (no queue entry) | Activity result lost, error logged |
| Timer pending in queue | Queue message deleted | Timer never fires |
| Waiting on external event | Instance deleted | Future events have no target |

### 8.3 Orphaned Activities

If an instance is deleted while activities are running:

1. Worker's periodic lock renewal fails (queue entry gone)
2. Worker detects this as cancellation signal via `ctx.is_cancelled()`
3. Well-behaved activities check cancellation and terminate gracefully
4. If activity ignores cancellation and completes, `ack_work_item` fails
5. Activity result is lost, worker logs error

**Recommended pattern for graceful cleanup:**
```rust
// 1. Request cancellation
client.cancel_instance("my-workflow").await?;

// 2. Wait for completion (with timeout)
match tokio::time::timeout(
    Duration::from_secs(30),
    client.wait_for_completion("my-workflow")
).await {
    Ok(_) => {
        // 3a. Gracefully completed, safe to delete
        client.delete_instance("my-workflow", false).await?;
    }
    Err(_) => {
        // 3b. Timeout - force delete as last resort
        client.delete_instance("my-workflow", true).await?;
    }
}
```

### 8.4 Parent-Child Consistency (Sub-Orchestrations)

**Rule:** Sub-orchestrations cannot be deleted directly. Only root orchestrations can be deleted, and deletion cascades to all descendants.

**Problem:** If a child sub-orchestration is deleted while the parent is still running:
- Parent hangs indefinitely waiting for `SubOrchCompleted` event that never arrives
- No timeout → permanent stuck state
- Even if parent is completed, allowing direct child deletion creates inconsistent state

**Resolution: Cascading Delete from Root Only**

1. **Block direct sub-orchestration deletion:** Any attempt to delete an instance that has a parent returns `Err(ClientError::CannotDeleteSubOrchestration)`. This applies to both `force=true` and `force=false`.

2. **Cascade from root:** When deleting a root orchestration:
   - Recursively discover all descendant sub-orchestrations
   - Delete all descendants first (depth-first, children before parents)
   - Then delete the root instance
   - All deletions happen in a single transaction

3. **Force applies to entire tree:** When `force=true` is used on a root:
   - The force flag applies to the root AND all descendants
   - If any instance in the tree is running, all are force-deleted together

**Implementation Requirements:**
- Add `parent_instance_id` column to `instances` table (nullable, NULL for root orchestrations)
- Populate `parent_instance_id` when sub-orchestration is created
- On delete: query for parent, reject if parent exists
- On delete of root: recursively find and delete all children

**Schema Addition:**
```sql
ALTER TABLE instances ADD COLUMN parent_instance_id TEXT REFERENCES instances(instance_id);
CREATE INDEX idx_instances_parent ON instances(parent_instance_id);
```

**Cascade Delete Algorithm:**
```
delete_instance(instance_id, force):
    1. Check if instance has a parent_instance_id
       - If yes: return Err(CannotDeleteSubOrchestration)
    2. Collect all descendants (recursive CTE or iterative query)
    3. Check status of root (and all descendants if force=false)
       - If any is Running and force=false: return Err(InstanceStillRunning)
    4. In single transaction:
       - Delete all descendants (ordered by depth, deepest first)
       - Delete root instance
    5. Return aggregated DeleteResult
```

### 8.5 Critical: Instance Lock Deletion Prevents Zombie Recreation

**This is a critical implementation requirement for all providers.**

When force-deleting an instance, the `instance_locks` table entry MUST be deleted. This prevents a race condition where an in-flight orchestration turn could recreate a deleted instance.

**The Race (if locks are NOT deleted):**
```
T0: Dispatcher fetches item, acquires lock in instance_locks
T1: Orchestration turn executes in memory...
T2: Force delete runs but does NOT delete from instance_locks
    - Deletes from: instances, executions, history, queues
T3: Turn completes, calls ack_orchestration_item()
T4: Lock lookup SUCCEEDS (lock still exists!)
T5: INSERT OR IGNORE INTO instances → RECREATES deleted instance!
T6: History written to "zombie" instance

❌ Result: Instance exists after being "deleted"
```

**The Fix (locks ARE deleted):**
```
T0: Dispatcher fetches item, acquires lock in instance_locks
T1: Orchestration turn executes in memory...
T2: Force delete runs:
    - DELETE FROM instance_locks WHERE instance_id = ?  ← Critical!
    - Deletes from: instances, executions, history, queues
T3: Turn completes, calls ack_orchestration_item()
T4: Lock lookup FAILS → "Invalid lock token" error
T5: Turn result discarded, no recreation

✅ Result: Instance stays deleted
```

**Provider Validation Test:** `test_force_delete_prevents_ack_recreation` verifies this behavior:
1. Create and start an orchestration
2. Fetch orchestration item (acquires lock)
3. Force delete the instance
4. Attempt to ack the item
5. Assert: ack returns error (not success)
6. Assert: instance does NOT exist in database

### 8.6 Identity Reuse

After deletion, a new instance with the same ID can be created immediately. This is intentional (useful for testing/resetting).

### 8.7 Active Execution Protection

The following are **NEVER** deleted/pruned:
- The `current_execution_id` of any instance
- Any execution with `status = 'Running'`

This ensures that:
- In-progress orchestrations are not corrupted
- `ContinueAsNew` chains don't lose their active head

---

## 9. Implementation Plan

### Phase 1: Schema Updates
1. **Fix `completed_at` storage** — Use Rust epoch milliseconds instead of `CURRENT_TIMESTAMP`
2. **Add `parent_instance_id` column** — Track parent-child relationships for cascading delete
3. Add migration if needed for existing data
4. **Populate `parent_instance_id`** — Update sub-orchestration creation to set parent reference

### Phase 2: Core Types & Provider
1. Add `InstanceFilter`, `PruneOptions` types to `src/providers/management.rs`
2. Add `DeleteResult`, `PurgeResult`, `PruneResult` types
3. Extend `ProviderAdmin` trait with new methods
4. Implement in `SqliteProvider`
5. Add provider validation tests

### Phase 3: Client Integration
1. Expose methods on `Client` struct
2. Add error types:
   - `ClientError::InstanceStillRunning` — Instance is running and force=false
   - `ClientError::CannotDeleteSubOrchestration` — Cannot delete sub-orchestration directly; delete root instead
   - `ClientError::InstanceNotFound` — Instance doesn't exist
3. Add integration tests

---

## 10. Test Plan

### 10.1 Provider Validation Tests (`src/provider_validations.rs`)

These tests validate the `ProviderAdmin` trait implementation.

#### Delete Instance Tests
| Test | Description |
|------|-------------|
| `test_delete_terminal_instances` | Delete completed, failed, and cancelled instances, verify all tables cleaned |
| `test_delete_running_rejected_force_succeeds` | Attempt to delete running instance without force (expect error), then with force (succeeds) |
| `test_delete_nonexistent_instance` | Delete non-existent instance returns `instances_deleted: 0` |
| `test_delete_cleans_queues_and_locks` | Verify orchestrator_queue, worker_queue, and instance_locks entries are deleted |
| `test_force_delete_prevents_ack_recreation` | **CRITICAL**: Fetch orchestration item (acquires lock), force delete instance, then try to ack - ack must fail, instance must NOT be recreated |
| `test_cascade_delete_hierarchy` | Delete root with children, verify all descendants deleted |

#### Bulk Delete Instance Tests
| Test | Description |
|------|-------------|
| `test_delete_instance_bulk_filter_combinations` | Delete by instance_ids, non-existent IDs, and empty filter |
| `test_delete_instance_bulk_safety_and_limits` | Skips running instances, respects limit parameter |
| `test_delete_instance_bulk_completed_before_filter` | Delete instances completed before/after cutoff |
| `test_delete_instance_bulk_cascades_to_children` | Bulk delete cascades to sub-orchestrations |

#### Prune Execution Tests
| Test | Description |
|------|-------------|
| `test_prune_options_combinations` | Keep last N, completed_before, and combined filters |
| `test_prune_safety` | Current execution and running executions are never pruned |

#### Bulk Prune Tests
| Test | Description |
|------|-------------|
| `test_prune_bulk` | Bulk prune with instance filter and prune options |

### 10.2 Force Delete Behavior Tests

These tests verify correct behavior when force-deleting instances with in-flight work.

#### Activity Interaction Tests
| Test | Description |
|------|-------------|
| `test_force_delete_activity_lock_renewal_fails` | Force delete instance while activity running; verify worker's next lock renewal fails |
| `test_force_delete_activity_detects_cancellation` | After lock renewal fails, verify `ctx.is_cancelled()` returns true |
| `test_force_delete_activity_ack_fails_gracefully` | Activity completes after delete; verify ack fails with appropriate error (not panic) |
| `test_force_delete_clears_worker_queue` | Verify all worker_queue entries for instance are deleted |
| `test_force_delete_clears_orchestrator_queue` | Verify all orchestrator_queue entries for instance are deleted |
| `test_force_delete_prevents_activity_completion_delivery` | **CRITICAL**: Fetch activity (acquires lock), force delete instance, ack activity - verify ack fails and no ActivityCompleted message is enqueued to orchestrator_queue |

#### Orchestration Turn Interaction Tests
| Test | Description |
|------|-------------|
| `test_force_delete_during_orchestration_turn` | Delete while turn is executing; turn completion fails to persist |
| `test_force_delete_between_turns` | Delete while instance is waiting (idle); clean deletion |
| `test_force_delete_pending_timer` | Delete instance with pending timer message; timer never fires |
| `test_force_delete_pending_external_event` | Delete instance waiting on event; event delivery fails gracefully |

#### Concurrent Operation Tests
| Test | Description |
|------|-------------|
| `test_concurrent_force_delete_and_activity_ack` | Race between delete and activity ack; one succeeds, other fails gracefully |
| `test_concurrent_force_delete_and_turn_completion` | Race between delete and turn persist; one succeeds, other fails gracefully |
| `test_concurrent_force_delete_and_lock_renewal` | Race between delete and lock renewal; renewal fails after delete |

#### Purge Behavior Tests
| Test | Description |
|------|-------------|
| `test_purge_skips_running_instances` | Bulk purge with Running instances in filter; they are silently skipped |
| `test_purge_only_deletes_terminal` | Purge only affects Completed/Failed instances |

### 10.3 Parent-Child Hierarchy Tests (Cascading Delete)

| Test | Description |
|------|-------------|
| `test_cascade_delete_hierarchy` | Sub-orchestrations cannot be deleted directly; cascade delete from root |
| `test_delete_get_instance_tree` | Get full tree with multiple levels of children |
| `test_delete_get_parent_id` | Get parent for sub-orchestrations, None for roots |
| `test_list_children` | List direct children of an instance |
| `test_delete_instances_atomic` | Atomic batch delete with orphan validation |
| `test_delete_instances_atomic_force` | Atomic batch delete with force flag |
| `test_delete_instances_atomic_orphan_detection` | Reject deletion if it would create orphans |

### 10.4 Time-Based Retention Tests

| Test | Description |
|------|-------------|
| `test_purge_5_day_retention` | Purge instances completed more than 5 days ago |
| `test_prune_30_day_execution_retention` | Prune executions older than 30 days |
| `test_timestamp_comparison_accuracy` | Verify millisecond precision in time comparisons |
| `test_completed_at_stored_as_integer` | Verify completed_at is stored as epoch ms, not TEXT |

### 10.5 Edge Case Tests

| Test | Description |
|------|-------------|
| `test_zombie_orchestrator_message_after_delete` | Delete instance, then dispatcher picks up orphaned message; verify graceful ack without processing |
| `test_zombie_worker_message_after_delete` | Delete instance, then worker picks up orphaned activity; verify graceful handling |
| `test_identity_reuse_after_delete` | Delete instance, immediately create new instance with same ID; verify clean slate |
| `test_prune_continue_as_new_chain` | Prune old executions from long ContinueAsNew chain |
| `test_delete_instance_multiple_pending_activities` | Delete instance with 3+ pending activities; all queue entries cleaned |
| `test_delete_sub_orchestration_always_rejected` | Direct deletion of sub-orchestration blocked regardless of force flag |
| `test_external_event_arrives_after_delete` | Send event to deleted instance; verify graceful "not found" handling |

---

## 11. Alternatives Considered

### Automatic TTL
Automatically deleting old records via background sweeper.
- *Pros*: Zero maintenance after configuration
- *Cons*: Complex to implement generically; less control; harder to debug

**Decision:** Provide explicit APIs first. Automatic TTL can be built on top of `purge_instances` later (e.g., a scheduled task calling the API).

### Soft Deletes
Marking records as deleted rather than removing them.
- *Pros*: Recoverable; audit trail
- *Cons*: Doesn't solve storage growth; all queries need to filter deleted items

**Decision:** Hard deletes only. Users needing audit trails should export data before deletion.

### OR vs AND Filter Semantics
Using OR between filter criteria (match any).
- *Cons*: Less intuitive; harder to express "only delete old items from this list"

**Decision:** AND semantics. Users wanting OR can make multiple calls.

---

## 12. Summary

| Operation | User API | Provider API | Scope | Return Type |
|-----------|----------|--------------|-------|-------------|
| Delete one instance | `client.delete_instance(id, force)` | `delete_instance(id, force)` | Single instance | `DeleteInstanceResult` |
| Bulk delete instances | `client.delete_instance_bulk(filter)` | `delete_instance_bulk(filter)` | Multi-instance | `DeleteInstanceResult` |
| Prune one instance | `client.prune_executions(id, options)` | `prune_executions(id, &options)` | Single instance | `PruneResult` |
| Bulk prune executions | `client.prune_executions_bulk(filter, options)` | `prune_executions_bulk(&filter, &options)` | Multi-instance | `PruneResult` |

**Key Design Decisions:**
- `InstanceFilter` is reusable across management APIs
- All filter criteria use AND semantics
- Running instances are always protected (skipped, not errored)
- Current execution is always protected during pruning
- Active (Running) executions are never pruned
- Sub-orchestrations can only be deleted via their root (cascade delete)
- Timestamps stored as Rust epoch milliseconds (not SQLite CURRENT_TIMESTAMP)
- Unified `DeleteInstanceResult` type for both single and bulk deletion (uses `instances_deleted: u64` instead of bool)