duroxide-cdb 0.1.10

A CosmosDB-based provider implementation for Duroxide, a durable task orchestration framework
Documentation
# duroxide-cdb AI Agent Instructions

CosmosDB NoSQL provider for [duroxide](https://github.com/microsoft/duroxide). Key docs: `docs/ARCHITECTURE.md`, `SPEC.md`.

## CRITICAL: Git Operations

**NEVER commit or push without explicit user permission.** Always ask before running `git commit` or `git push`.

## CRITICAL: Test Execution

**ALWAYS use nextest for running tests.** This is mandatory.

```bash
cargo nt                                                # Full test pass (cargo alias: nextest run --all-features)
cargo nt -E 'test(/pattern/)'                           # Filter by pattern
cargo nt --test e2e_samples                             # E2E sample tests only
cargo ntw                                               # Full test pass with output (--no-capture)
cargo c                                                 # Clippy (--all-targets --all-features)
cargo f                                                 # Format
```

Aliases are defined in `.cargo/config.toml`. Only fall back to `cargo test` if nextest is not available. Never use `cargo test` when nextest works.

## Architecture Overview

**Single-container CosmosDB design** with `/instanceId` partition key. Six document types coexist:
- `instance` — orchestration metadata + lock state
- `history` — append-only event log
- `orch_queue` — orchestration dispatcher work items
- `worker_queue` — activity worker work items
- `outbox_intent` — pending cross-partition writes
- `session` — session affinity ownership

**Key design patterns:**
- Optimistic concurrency via ETags (no database locks)
- 256-slot dispatch partitioning via `LeaseProvider`
- Transactional outbox for cross-partition writes (sub-orchestrations, completions)
- Raw REST client (no Azure SDK) with HMAC-SHA256 auth

## Module Map

| Module | Purpose |
|--------|---------|
| `provider.rs` | `Provider` + `ProviderAdmin` trait implementations (~2000 lines, main logic) |
| `client.rs` | `CosmosDBClient` — HTTP transport, auth, CRUD, query, batch |
| `models.rs` | Serde structs for all 6 document types |
| `query.rs` | Cross-partition and single-partition query builders |
| `batch.rs` | Transactional batch operations (up to 100 ops per batch) |
| `containers.rs` | Database/container bootstrapping with retry for 429s |
| `outbox.rs` | Outbox intent delivery + background reconciler |
| `leases.rs` | `LeaseProvider` trait + `InMemoryLeaseProvider` (256-slot distribution) |
| `errors.rs` | CosmosDB HTTP status → `ProviderError` mapping |
| `lib.rs` | Public API re-exports |

## Critical Implementation Details

### Cross-Partition Query Limitations (REST Gateway)

CosmosDB's REST gateway does NOT support cross-partition queries with:
- `ORDER BY` — returns 400 with query plan
- Server-side aggregates (`COUNT`, `MAX`, `SUM`) — same issue
- `TOP N` combined with `ORDER BY`

**How we handle this:**
- Queue candidate queries: no `ORDER BY`, fetch all matching items, sort **client-side** by `enqueuedAt`
- Count queries: `SELECT c.id` + client-side `.len()` instead of `SELECT VALUE COUNT(1)`
- Partition-scoped queries (history, messages): CAN use `ORDER BY` safely

### Session Piggyback Timestamps

`ack_work_item` and `renew_work_item_lock` piggyback-update session `lastActivity`. They must use a **fresh `now_ms()`** at the point of update, NOT the timestamp from the start of the function. Network latency makes early timestamps stale, causing idle window checks to fail.

### Transactional Batch Limits

CosmosDB allows max 100 operations per batch. `ack_orchestration_item` splits into sequential batches if exceeded. The first batch includes the instance upsert (releases the lock). Subsequent batches are best-effort.

### Cancelled Activity Deletes Are Best-Effort

When an orchestration cancels an in-flight activity (e.g., `select2` picks a timer over an activity), the cancelled activity's `worker_queue` document must NOT be deleted inside the transactional batch. The worker dispatcher may have already consumed and deleted it, causing a 404 inside the batch which fails the entire transaction with 424. Instead, cancelled activity deletes are performed best-effort after the batch commits, silently ignoring 404s.

### Dispatch Slot Partitioning

Every queue item has a precomputed `dispatchSlot = hash(instanceId) % 256`. Dispatchers only query their assigned slots via `AND c.dispatchSlot IN (...)`. When all 256 slots are assigned (single dispatcher), skip the `IN` clause entirely.

### Optimistic Locking Pattern

All lock operations follow: read → get `_etag` → modify → conditional replace with `If-Match` header → handle 412/409.

## Error Handling

- 409 Conflict → `retryable` (ETag race or duplicate)
- 412 Precondition Failed → `retryable` (ETag mismatch)
- 429 Too Many Requests → `retryable` (rate limited, backoff)
- 404 Not Found → `permanent`
- 400 Bad Request → `permanent` (likely a cross-partition query issue)

## Configuration

Environment variables (`.env` file, loaded via `dotenvy`):
- `COSMOS_ENDPOINT` — CosmosDB endpoint URL
- `COSMOS_KEY` — CosmosDB master key
- `COSMOS_DATABASE` — Database name (default: `duroxide`)

Programmatic config via `CosmosDBProviderConfig`:
- `orch_concurrency` / `worker_concurrency` — number of dispatchers (controls slot partitioning)
- `reconciler_interval` / `reconciler_age_threshold` — outbox reconciler timing (default: 2s each)

## Testing

### Test Infrastructure

- Tests use **per-test containers** (unique UUID suffix) for isolation
- Test concurrency limited to 4 threads (`.config/nextest.toml`) to avoid metadata 429s
- `ensure_infrastructure` has retry with exponential backoff for container creation

### Provider Validation Tests

`tests/cosmosdb_provider_test.rs` implements `ProviderFactory` and runs ~196 duroxide validation tests. Each test gets its own container.

### E2E Sample Tests

`tests/e2e_samples.rs` ports the full duroxide e2e sample suite. Uses `tests/common/mod.rs` for shared helpers (`create_cosmos_store`, `wait_for_history`, etc.).

### Local Development

```bash
# Start CosmosDB emulator
docker run -p 8081:8081 -p 10250-10255:10250-10255 \
  mcr.microsoft.com/cosmosdb/linux/azure-cosmos-emulator:latest

# Copy env and run tests
cp .env.example .env
cargo nextest run --features provider-test
```

## Build Commands

```bash
cargo nt                                                # Full test pass (nextest + all features)
cargo nt --test e2e_samples                             # E2E samples only
cargo c                                                 # Clippy (all targets, all features)
cargo build --all-targets                               # Build everything
cargo check --features provider-test --tests            # Quick type check
```

## Key Directories

- `src/` — Provider implementation
- `tests/` — Integration tests (provider validation, e2e samples)
- `tests/common/` — Shared test helpers
- `docs/` — Architecture documentation

## When Changing Provider Code

1. Run full test pass: `cargo nt`
2. Run e2e samples: `cargo nt --test e2e_samples`
3. Check for cross-partition query issues (no `ORDER BY`/aggregates in cross-partition queries)
4. Ensure session piggyback updates use fresh `now_ms()`
5. Verify transactional batch stays under 100 operations
6. Run clippy: `cargo c`

## Workflow Rules

- **Never commit or push without explicit user permission.**
- **Always use nextest** for running tests (`cargo nt`).
- **Test against both emulator and Azure CosmosDB** — the emulator is more lenient with cross-partition queries.

## Updating duroxide Dependency

When a new version of `duroxide` is published to crates.io:

1. **Review changes**: Read the duroxide CHANGELOG, README, and `docs/provider-implementation-guide.md` at the duroxide repo (drox root: `duroxide/`)
2. **Update Cargo.toml**: Bump `duroxide` version, run `cargo check`, fix any compilation errors
3. **Implement API changes**: Update `src/provider.rs` for any `Provider`/`ProviderAdmin` trait changes
4. **Check cross-partition query impact**: If new provider methods involve multi-instance queries, ensure no `ORDER BY` or server-side aggregates in cross-partition queries
5. **Add validation tests**: New provider validation tests from duroxide will automatically run via the `ProviderFactory` pattern in `tests/cosmosdb_provider_test.rs`
6. **Test thoroughly**: `cargo nt` (full test pass), `cargo nt --test e2e_samples` (e2e samples)
7. **Update docs**: CHANGELOG.md, README.md, bump version in Cargo.toml
8. **Publish**: `cargo publish` (only with explicit user permission)

> ⚠️ **Never push to remote or publish to crates.io without explicit user confirmation**
> ⚠️ **Keep consistent with PG providers** — all providers must implement the same `Provider`/`ProviderAdmin` traits. Check `providers/duroxide-pg/` and `providers/duroxide-pg-opt/` for reference on how they handle the same changes.

## Self-Updating These Instructions

When you discover important implementation details, design decisions, or gotchas that would help future sessions, **update this file** (`.github/copilot-instructions.md`) and/or `docs/ARCHITECTURE.md`. Examples:
- New cross-partition query limitations encountered
- Semantic differences between CosmosDB and Postgres that affect correctness
- New critical invariants or race conditions identified
- Changes to build/test commands or aliases

**Always notify the user** when you modify these instructions — never silently update them.