duroxide 0.1.27

# Management API Improvements Proposal

> **Status**: Proposed  
> **Issue**: [#10](https://github.com/microsoft/duroxide/issues/10)  
> **Related**: [Core Improvements Roadmap](proposals/core-improvements-roadmap.md#3-management-api-improvements)

## Overview

This proposal details improvements to the duroxide Management API to address gaps in production workloads: the N+1 query problem, missing data cleanup APIs, pagination/filtering limitations, and query performance at scale.

---

## Table of Contents

1. [Problem Statement](#problem-statement)
2. [Enriched List Responses](#1-enriched-list-responses)
3. [Truncate APIs](#2-truncate-apis)
4. [Pagination & Filtering](#3-pagination--filtering)
5. [Performance Optimizations](#4-performance-optimizations)
6. [Schema Changes](#5-schema-changes)
7. [Error Handling](#6-error-handling)
8. [Implementation Phases](#7-implementation-phases)
9. [Open Questions](#8-open-questions)

---

## Problem Statement

The current management API has significant gaps that limit its usefulness for production workloads:

### N+1 Query Problem

`list_instances()` returns only instance IDs, forcing clients to make N additional calls to `get_instance_info()` for basic dashboard data:

```rust
// Current: O(N+1) queries for N instances
let ids = client.list_all_instances().await?;  // 1 query
for id in &ids {
    let info = client.get_instance_info(id).await?;  // N queries
    println!("{}: {}", info.orchestration_name, info.status);
}
```

For 1,000 instances, this means 1,001 database round-trips. Real-world dashboards become unusable.

### No Data Cleanup APIs

There's no way to:
- Delete completed/failed instances after testing
- Truncate history for eternal orchestrations (continuous/polling workflows)
- Clean up orphaned data from crashed workers
- Bulk delete instances matching criteria

Eternal orchestrations using `continue_as_new` accumulate unbounded history:

```
Instance "daily-report"
├── Execution 1: 500 events (day 1)
├── Execution 2: 500 events (day 2)
├── ... 
└── Execution 365: 500 events (day 365)
    → 182,500 events, growing forever
```

### Performance at Scale

Current implementation has no pagination or efficient filtering:
- `list_instances()` returns ALL instances in memory
- `list_instances_by_status()` scans entire table
- No indexes optimized for management queries
- Dashboard queries compete with hot-path operations

### Missing Query Capabilities

Operators cannot answer basic questions:
- "Show me all failed `OrderWorkflow` instances from last hour"
- "How many `PaymentProcessor` instances are currently running?"
- "Find instances stuck in running state for > 24 hours"

---

## Proposed Solution

This proposal introduces four categories of improvements:
1. **Enriched List Responses** — Return full metadata in single query
2. **Truncate APIs** — Delete instances and prune execution history
3. **Pagination & Filtering** — Scalable queries with cursor-based pagination
4. **Performance Optimizations** — Indexes, batching, and query patterns

---

## 1. Enriched List Responses

### 1.1 Enhanced InstanceInfo Struct

Extend the existing `InstanceInfo` to include additional useful fields:

```rust
/// Comprehensive instance metadata for management operations.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct InstanceInfo {
    // Existing fields
    pub instance_id: String,
    pub orchestration_name: String,
    pub orchestration_version: String,
    pub current_execution_id: u64,
    pub status: String,
    pub output: Option<String>,
    pub created_at: u64,
    pub updated_at: u64,
    
    // NEW: Additional metadata
    /// Total number of executions (for continue_as_new tracking)
    pub execution_count: u64,
    /// Total events across all executions (storage indicator)
    pub total_event_count: u64,
    /// Input provided when instance was started
    pub input: Option<String>,
    /// Tags for categorization (if tag support added)
    pub tags: Option<Vec<String>>,
}
```

### 1.2 Batch List API

New provider trait method to return enriched data in a single query:

```rust
#[async_trait::async_trait]
pub trait ProviderAdmin: Any + Send + Sync {
    // ... existing methods ...

    /// List all instances with full metadata in a single query.
    /// 
    /// Unlike `list_instances()` which returns only IDs, this method
    /// returns complete `InstanceInfo` for each instance, eliminating
    /// the N+1 query problem.
    ///
    /// # Performance
    /// 
    /// Single SQL query with JOINs and aggregations. For large datasets,
    /// prefer `list_instances_paginated()` to avoid memory issues.
    ///
    /// # Implementation Example
    ///
    /// ```sql
    /// SELECT 
    ///     i.instance_id,
    ///     i.orchestration_name,
    ///     i.orchestration_version,
    ///     i.current_execution_id,
    ///     i.created_at,
    ///     i.updated_at,
    ///     e.status,
    ///     e.output,
    ///     (SELECT COUNT(*) FROM executions WHERE instance_id = i.instance_id) as execution_count,
    ///     (SELECT COUNT(*) FROM history WHERE instance_id = i.instance_id) as total_event_count
    /// FROM instances i
    /// LEFT JOIN executions e ON i.instance_id = e.instance_id 
    ///     AND i.current_execution_id = e.execution_id
    /// ORDER BY i.created_at DESC
    /// ```
    async fn list_instances_with_info(&self) -> Result<Vec<InstanceInfo>, ProviderError>;
}
```

### 1.3 Client API Addition

```rust
impl Client {
    /// List all instances with full metadata.
    ///
    /// Returns complete instance information in a single database query,
    /// avoiding the N+1 problem of calling `list_all_instances()` followed
    /// by `get_instance_info()` for each.
    ///
    /// # Example
    ///
    /// ```ignore
    /// let instances = client.list_instances_with_info().await?;
    /// for info in instances {
    ///     println!("{} ({}): {} - {} events", 
    ///         info.instance_id,
    ///         info.orchestration_name,
    ///         info.status,
    ///         info.total_event_count
    ///     );
    /// }
    /// ```
    pub async fn list_instances_with_info(&self) -> Result<Vec<InstanceInfo>, ClientError>;
}
```

---

## 2. Truncate APIs

### 2.1 Instance Deletion

Complete deletion of an instance and all associated data:

```rust
#[async_trait::async_trait]
pub trait ProviderAdmin: Any + Send + Sync {
    // ... existing methods ...

    /// Delete an instance and all its associated data.
    ///
    /// Removes:
    /// - Instance record from `instances` table
    /// - All executions from `executions` table
    /// - All history events from `history` table
    /// - All pending queue messages for this instance
    /// - Any instance locks
    ///
    /// # Safety
    ///
    /// This operation is IRREVERSIBLE. The instance and all its history
    /// will be permanently deleted.
    ///
    /// # Constraints
    ///
    /// - Instance must NOT be in "Running" status (prevents deleting active work)
    /// - Use `force = true` to override (for cleanup of stuck instances)
    ///
    /// # Returns
    ///
    /// `TruncateResult` with counts of deleted records.
    ///
    /// # Errors
    ///
    /// - `InstanceNotFound` if instance doesn't exist
    /// - `InstanceStillRunning` if instance is active and force=false
    async fn delete_instance(
        &self,
        instance_id: &str,
        force: bool,
    ) -> Result<TruncateResult, ProviderError>;
    
    /// Bulk delete instances matching criteria.
    ///
    /// Efficiently deletes multiple instances in a single transaction.
    /// Useful for cleanup operations.
    ///
    /// # Parameters
    ///
    /// * `filter` - Criteria for selecting instances to delete
    ///
    /// # Returns
    ///
    /// Total counts of deleted records across all instances.
    async fn delete_instances_bulk(
        &self,
        filter: InstanceFilter,
    ) -> Result<TruncateResult, ProviderError>;
}

/// Result of truncation/deletion operations.
#[derive(Debug, Clone, Default)]
pub struct TruncateResult {
    /// Number of instance records deleted
    pub instances_deleted: u64,
    /// Number of execution records deleted
    pub executions_deleted: u64,
    /// Number of history events deleted
    pub events_deleted: u64,
    /// Number of queue messages deleted
    pub messages_deleted: u64,
}
```

### 2.2 Execution History Pruning

For eternal orchestrations, prune old executions while keeping recent ones:

```rust
#[async_trait::async_trait]
pub trait ProviderAdmin: Any + Send + Sync {
    // ... existing methods ...

    /// Prune old executions for an instance, keeping only recent ones.
    ///
    /// This is critical for "eternal" orchestrations that use `continue_as_new`
    /// repeatedly. Without pruning, history grows unbounded.
    ///
    /// # Parameters
    ///
    /// * `instance_id` - Instance to prune
    /// * `keep_last_n` - Number of most recent executions to retain
    ///
    /// # Behavior
    ///
    /// - Deletes executions where `execution_id < (current - keep_last_n)`
    /// - Deletes associated history events
    /// - NEVER deletes the current execution (even if keep_last_n = 0)
    ///
    /// # Example
    ///
    /// Instance has executions [1, 2, 3, 4, 5], current is 5:
    /// - `prune_executions("inst", 2)` → deletes 1, 2, 3; keeps 4, 5
    /// - `prune_executions("inst", 0)` → deletes 1, 2, 3, 4; keeps 5
    ///
    /// # Returns
    ///
    /// Number of executions deleted.
    async fn prune_executions(
        &self,
        instance_id: &str,
        keep_last_n: u64,
    ) -> Result<u64, ProviderError>;

    /// Prune executions older than a specified age.
    ///
    /// # Parameters
    ///
    /// * `instance_id` - Instance to prune
    /// * `older_than` - Delete executions completed before this duration ago
    ///
    /// # Returns
    ///
    /// Number of executions deleted.
    async fn prune_executions_by_age(
        &self,
        instance_id: &str,
        older_than: Duration,
    ) -> Result<u64, ProviderError>;
}
```

### 2.3 Automatic Retention Policy

Allow configuring automatic pruning for eternal orchestrations:

```rust
/// Retention policy for automatic execution pruning.
#[derive(Debug, Clone)]
pub struct ExecutionRetentionPolicy {
    /// Keep at least this many recent executions.
    /// Set to 0 to keep only the current execution.
    pub keep_last_n: u64,
    
    /// Delete executions older than this duration.
    /// Applied after keep_last_n (both conditions must be met for deletion).
    pub keep_duration: Option<Duration>,
    
    /// Run pruning automatically after each continue_as_new.
    /// If false, pruning must be triggered manually or via scheduled job.
    pub auto_prune: bool,
}

impl Default for ExecutionRetentionPolicy {
    fn default() -> Self {
        Self {
            keep_last_n: 10,
            keep_duration: Some(Duration::from_secs(7 * 24 * 60 * 60)), // 7 days
            auto_prune: false,
        }
    }
}

/// Configure retention when starting an orchestration.
impl Client {
    pub async fn start_orchestration_with_options(
        &self,
        name: &str,
        version: &str,
        instance_id: &str,
        input: Option<&str>,
        options: StartOptions,
    ) -> Result<(), ClientError>;
}

#[derive(Debug, Clone, Default)]
pub struct StartOptions {
    /// Execution retention policy for this instance.
    /// Only meaningful for orchestrations that use continue_as_new.
    pub retention_policy: Option<ExecutionRetentionPolicy>,
    
    /// Tags for categorization and filtering.
    pub tags: Option<Vec<String>>,
}
```

### 2.4 Client API for Truncation

```rust
impl Client {
    /// Delete a completed or failed instance.
    ///
    /// # Example
    ///
    /// ```ignore
    /// // Clean up after testing
    /// client.delete_instance("test-instance-123", false).await?;
    ///
    /// // Force delete a stuck instance
    /// client.delete_instance("stuck-instance", true).await?;
    /// ```
    pub async fn delete_instance(
        &self,
        instance_id: &str,
        force: bool,
    ) -> Result<TruncateResult, ClientError>;
    
    /// Prune old executions for an eternal orchestration.
    ///
    /// # Example
    ///
    /// ```ignore
    /// // Keep only last 5 executions
    /// let deleted = client.prune_executions("daily-job", 5).await?;
    /// println!("Pruned {} old executions", deleted);
    /// ```
    pub async fn prune_executions(
        &self,
        instance_id: &str,
        keep_last_n: u64,
    ) -> Result<u64, ClientError>;
    
    /// Bulk delete instances matching criteria.
    ///
    /// # Example
    ///
    /// ```ignore
    /// // Delete all completed instances older than 30 days
    /// let filter = InstanceFilter::new()
    ///     .with_status("Completed")
    ///     .older_than(Duration::from_secs(30 * 24 * 60 * 60));
    /// let result = client.delete_instances_bulk(filter).await?;
    /// println!("Deleted {} instances", result.instances_deleted);
    /// ```
    pub async fn delete_instances_bulk(
        &self,
        filter: InstanceFilter,
    ) -> Result<TruncateResult, ClientError>;
}
```

---

## 3. Pagination & Filtering

### 3.1 Filter Criteria

```rust
/// Filter criteria for listing instances.
#[derive(Debug, Clone, Default)]
pub struct InstanceFilter {
    /// Filter by orchestration status.
    /// Multiple values = OR (matches any).
    pub status: Option<Vec<String>>,
    
    /// Filter by orchestration name (exact match).
    pub orchestration_name: Option<String>,
    
    /// Filter by orchestration name prefix.
    pub orchestration_name_prefix: Option<String>,
    
    /// Filter by instance ID prefix.
    pub instance_id_prefix: Option<String>,
    
    /// Only include instances created after this timestamp (millis).
    pub created_after: Option<u64>,
    
    /// Only include instances created before this timestamp (millis).
    pub created_before: Option<u64>,
    
    /// Only include instances updated after this timestamp (millis).
    pub updated_after: Option<u64>,
    
    /// Only include instances updated before this timestamp (millis).
    pub updated_before: Option<u64>,
    
    /// Filter by tags (if tag support is implemented).
    /// Multiple values = AND (must have all tags).
    pub tags: Option<Vec<String>>,
}

impl InstanceFilter {
    pub fn new() -> Self {
        Self::default()
    }
    
    pub fn with_status(mut self, status: &str) -> Self {
        self.status = Some(vec![status.to_string()]);
        self
    }
    
    pub fn with_statuses(mut self, statuses: Vec<&str>) -> Self {
        self.status = Some(statuses.into_iter().map(String::from).collect());
        self
    }
    
    pub fn with_orchestration_name(mut self, name: &str) -> Self {
        self.orchestration_name = Some(name.to_string());
        self
    }
    
    pub fn with_name_prefix(mut self, prefix: &str) -> Self {
        self.orchestration_name_prefix = Some(prefix.to_string());
        self
    }
    
    pub fn created_after(mut self, timestamp: u64) -> Self {
        self.created_after = Some(timestamp);
        self
    }
    
    pub fn older_than(mut self, duration: Duration) -> Self {
        let now = SystemTime::now()
            .duration_since(UNIX_EPOCH)
            .unwrap()
            .as_millis() as u64;
        self.created_before = Some(now.saturating_sub(duration.as_millis() as u64));
        self
    }
    
    /// Check if any filters are set.
    pub fn is_empty(&self) -> bool {
        self.status.is_none()
            && self.orchestration_name.is_none()
            && self.orchestration_name_prefix.is_none()
            && self.instance_id_prefix.is_none()
            && self.created_after.is_none()
            && self.created_before.is_none()
            && self.updated_after.is_none()
            && self.updated_before.is_none()
            && self.tags.is_none()
    }
}
```

### 3.2 Pagination with Cursors

```rust
/// Cursor-based pagination for scalable listing.
#[derive(Debug, Clone)]
pub struct PaginationOptions {
    /// Maximum number of results to return.
    /// Default: 100, Max: 1000
    pub limit: u32,
    
    /// Opaque cursor from previous page's `next_cursor`.
    /// None = start from beginning.
    pub cursor: Option<String>,
    
    /// Sort order.
    pub order: SortOrder,
}

#[derive(Debug, Clone, Copy, Default)]
pub enum SortOrder {
    /// Newest first (by created_at DESC)
    #[default]
    CreatedDesc,
    /// Oldest first (by created_at ASC)
    CreatedAsc,
    /// Most recently updated first
    UpdatedDesc,
    /// Least recently updated first
    UpdatedAsc,
}

impl Default for PaginationOptions {
    fn default() -> Self {
        Self {
            limit: 100,
            cursor: None,
            order: SortOrder::default(),
        }
    }
}

/// Paginated result set.
#[derive(Debug, Clone)]
pub struct PaginatedResult<T> {
    /// Items in this page.
    pub items: Vec<T>,
    
    /// Cursor for fetching next page.
    /// None if this is the last page.
    pub next_cursor: Option<String>,
    
    /// Total count matching filter (if available).
    /// May be None if counting is too expensive.
    pub total_count: Option<u64>,
    
    /// Whether there are more results.
    pub has_more: bool,
}
```

### 3.3 Provider Trait Additions

```rust
#[async_trait::async_trait]
pub trait ProviderAdmin: Any + Send + Sync {
    // ... existing methods ...

    /// List instances with filtering and pagination.
    ///
    /// # Parameters
    ///
    /// * `filter` - Criteria to filter instances
    /// * `pagination` - Pagination options (limit, cursor, order)
    ///
    /// # Returns
    ///
    /// Paginated result with instances and next cursor.
    ///
    /// # Implementation Notes
    ///
    /// Cursor should encode (created_at, instance_id) for stable pagination
    /// even as new instances are created.
    ///
    /// # Example SQL (CreatedDesc order)
    ///
    /// ```sql
    /// SELECT i.*, e.status, e.output
    /// FROM instances i
    /// LEFT JOIN executions e ON i.instance_id = e.instance_id 
    ///     AND i.current_execution_id = e.execution_id
    /// WHERE 
    ///     (? IS NULL OR e.status IN (?))
    ///     AND (? IS NULL OR i.orchestration_name = ?)
    ///     AND (? IS NULL OR i.orchestration_name LIKE ? || '%')
    ///     AND (? IS NULL OR i.created_at > ?)
    ///     AND (? IS NULL OR i.created_at < ?)
    ///     AND (? IS NULL OR (i.created_at, i.instance_id) < (?, ?))  -- cursor
    /// ORDER BY i.created_at DESC, i.instance_id DESC
    /// LIMIT ?
    /// ```
    async fn list_instances_paginated(
        &self,
        filter: InstanceFilter,
        pagination: PaginationOptions,
    ) -> Result<PaginatedResult<InstanceInfo>, ProviderError>;
    
    /// Count instances matching filter.
    ///
    /// Useful for UI pagination (showing "Page X of Y").
    /// May be expensive on large datasets.
    ///
    /// # Returns
    ///
    /// Total count of instances matching filter.
    async fn count_instances(
        &self,
        filter: InstanceFilter,
    ) -> Result<u64, ProviderError>;
}
```

### 3.4 Client API

```rust
impl Client {
    /// List instances with filtering and pagination.
    ///
    /// # Example
    ///
    /// ```ignore
    /// // Get first page of running OrderWorkflow instances
    /// let filter = InstanceFilter::new()
    ///     .with_status("Running")
    ///     .with_orchestration_name("OrderWorkflow");
    /// let pagination = PaginationOptions { limit: 50, ..Default::default() };
    /// 
    /// let page1 = client.list_instances_paginated(filter.clone(), pagination).await?;
    /// println!("Found {} instances", page1.items.len());
    /// 
    /// // Get next page
    /// if let Some(cursor) = page1.next_cursor {
    ///     let page2 = client.list_instances_paginated(
    ///         filter,
    ///         PaginationOptions { cursor: Some(cursor), ..Default::default() }
    ///     ).await?;
    /// }
    /// ```
    pub async fn list_instances_paginated(
        &self,
        filter: InstanceFilter,
        pagination: PaginationOptions,
    ) -> Result<PaginatedResult<InstanceInfo>, ClientError>;
    
    /// Count instances matching filter.
    pub async fn count_instances(
        &self,
        filter: InstanceFilter,
    ) -> Result<u64, ClientError>;
}
```

---

## 4. Performance Optimizations

### 4.1 Database Indexes

Add indexes specifically for management query patterns:

```sql
-- Index for filtering by status (via JOIN with executions)
CREATE INDEX IF NOT EXISTS idx_executions_status 
    ON executions(status, instance_id);

-- Index for filtering by orchestration name
CREATE INDEX IF NOT EXISTS idx_instances_name 
    ON instances(orchestration_name, created_at DESC);

-- Index for time-range queries
CREATE INDEX IF NOT EXISTS idx_instances_created 
    ON instances(created_at DESC, instance_id);

-- Index for updated_at queries (find stale instances)
CREATE INDEX IF NOT EXISTS idx_instances_updated 
    ON instances(updated_at DESC);

-- Covering index for common listing query
CREATE INDEX IF NOT EXISTS idx_instances_list_covering 
    ON instances(created_at DESC, instance_id, orchestration_name, orchestration_version, current_execution_id);

-- Index for execution pruning
CREATE INDEX IF NOT EXISTS idx_history_execution 
    ON history(instance_id, execution_id);
```

### 4.2 Query Optimization Patterns

**Avoid COUNT(*) for pagination:**

```rust
// Instead of: SELECT COUNT(*) FROM instances WHERE ...
// Use: fetch limit+1 and check if we got more

async fn list_instances_paginated(...) -> Result<PaginatedResult<InstanceInfo>, ProviderError> {
    // Fetch one extra to determine if there are more pages
    let fetch_limit = pagination.limit + 1;
    
    let rows = sqlx::query(/* ... LIMIT ? */)
        .bind(fetch_limit)
        .fetch_all(&self.pool)
        .await?;
    
    let has_more = rows.len() > pagination.limit as usize;
    let items: Vec<InstanceInfo> = rows
        .into_iter()
        .take(pagination.limit as usize)
        .map(|r| /* convert */)
        .collect();
    
    let next_cursor = if has_more {
        items.last().map(|i| encode_cursor(&i.created_at, &i.instance_id))
    } else {
        None
    };
    
    Ok(PaginatedResult { items, next_cursor, total_count: None, has_more })
}
```

**Cursor encoding for stable pagination:**

```rust
fn encode_cursor(created_at: u64, instance_id: &str) -> String {
    // Use base64 to hide implementation details
    let raw = format!("{}:{}", created_at, instance_id);
    base64::encode(raw)
}

fn decode_cursor(cursor: &str) -> Result<(u64, String), ProviderError> {
    let raw = base64::decode(cursor)
        .map_err(|_| ProviderError::permanent("decode_cursor", "Invalid cursor"))?;
    let s = String::from_utf8(raw)
        .map_err(|_| ProviderError::permanent("decode_cursor", "Invalid cursor encoding"))?;
    let parts: Vec<&str> = s.splitn(2, ':').collect();
    if parts.len() != 2 {
        return Err(ProviderError::permanent("decode_cursor", "Malformed cursor"));
    }
    let created_at = parts[0].parse::<u64>()
        .map_err(|_| ProviderError::permanent("decode_cursor", "Invalid timestamp in cursor"))?;
    Ok((created_at, parts[1].to_string()))
}
```

### 4.3 Batched Operations

For bulk deletions, use batched transactions to avoid long-running locks:

```rust
async fn delete_instances_bulk(
    &self,
    filter: InstanceFilter,
) -> Result<TruncateResult, ProviderError> {
    const BATCH_SIZE: u32 = 100;
    let mut total_result = TruncateResult::default();
    
    loop {
        // Find next batch of instance IDs to delete
        let pagination = PaginationOptions { limit: BATCH_SIZE, ..Default::default() };
        let page = self.list_instances_paginated(filter.clone(), pagination).await?;
        
        if page.items.is_empty() {
            break;
        }
        
        // Delete this batch in a transaction
        let mut tx = self.pool.begin().await?;
        for info in &page.items {
            let result = self.delete_instance_in_tx(&mut tx, &info.instance_id).await?;
            total_result.instances_deleted += result.instances_deleted;
            total_result.executions_deleted += result.executions_deleted;
            total_result.events_deleted += result.events_deleted;
            total_result.messages_deleted += result.messages_deleted;
        }
        tx.commit().await?;
        
        if !page.has_more {
            break;
        }
    }
    
    Ok(total_result)
}
```

---

## 5. Schema Changes

### 5.1 Migration SQL

```sql
-- Migration: Add management API indexes
-- Version: 2024_xx_management_api_indexes

-- Index for status filtering
CREATE INDEX IF NOT EXISTS idx_executions_status 
    ON executions(status, instance_id);

-- Index for orchestration name filtering
CREATE INDEX IF NOT EXISTS idx_instances_name 
    ON instances(orchestration_name, created_at DESC);

-- Index for time-range queries
CREATE INDEX IF NOT EXISTS idx_instances_created 
    ON instances(created_at DESC, instance_id);

-- Index for finding stale instances
CREATE INDEX IF NOT EXISTS idx_instances_updated 
    ON instances(updated_at DESC);

-- Index for execution pruning
CREATE INDEX IF NOT EXISTS idx_history_execution 
    ON history(instance_id, execution_id);


-- Migration: Add optional instance metadata
-- Version: 2024_xx_instance_metadata

-- Add input column to track original input (optional)
ALTER TABLE instances ADD COLUMN input TEXT;

-- Add tags column for categorization (JSON array, optional)
ALTER TABLE instances ADD COLUMN tags TEXT;

-- Add retention policy (JSON, optional)
ALTER TABLE instances ADD COLUMN retention_policy TEXT;
```

### 5.2 Backward Compatibility

All new columns are nullable with no default, ensuring:
- Existing instances continue to work unchanged
- New features are opt-in
- No data migration required for existing deployments

---

## 6. Error Handling

### 6.1 New Error Variants

```rust
#[derive(Debug, Clone)]
pub enum ManagementError {
    /// Instance not found
    InstanceNotFound { instance_id: String },
    
    /// Cannot delete a running instance without force flag
    InstanceStillRunning { instance_id: String },
    
    /// Invalid filter criteria
    InvalidFilter { message: String },
    
    /// Invalid cursor format
    InvalidCursor { cursor: String },
    
    /// Operation would exceed limits
    LimitExceeded { 
        limit_name: String, 
        limit_value: u64, 
        requested: u64 
    },
}
```

---

## 7. Implementation Phases

### Phase 1: Core APIs (MVP)
- [ ] `list_instances_with_info()` — enriched listing
- [ ] `delete_instance()` — single instance deletion
- [ ] `prune_executions()` — execution history pruning
- [ ] Basic filtering by status and orchestration name

### Phase 2: Pagination & Performance
- [ ] `list_instances_paginated()` with cursor-based pagination
- [ ] `count_instances()` for total counts
- [ ] Database indexes for management queries
- [ ] Batched bulk operations

### Phase 3: Advanced Features
- [ ] `delete_instances_bulk()` with filters
- [ ] Automatic retention policies
- [ ] Tag support for instances
- [ ] Time-range filtering

---

## 8. Open Questions

1. **Retention Policy Storage**: Should retention policies be stored in the instance record, or in a separate configuration table?

2. **Automatic vs Manual Pruning**: Should auto-pruning happen during `continue_as_new`, or via a background maintenance job?

3. **Soft vs Hard Delete**: Should we support soft deletes (mark as deleted but retain data) for audit purposes?

4. **Read Replicas**: Should management queries be routable to read replicas for production workloads?

5. **Metrics Integration**: Should truncation operations emit metrics (events deleted, time taken)?