# duroxide-cdb AI Agent Instructions
CosmosDB NoSQL provider for [duroxide](https://github.com/microsoft/duroxide). Key docs: `docs/ARCHITECTURE.md`, `SPEC.md`.
## CRITICAL: Git Operations
**NEVER commit or push without explicit user permission.** Always ask before running `git commit` or `git push`.
## CRITICAL: Test Execution
**ALWAYS use nextest for running tests.** This is mandatory.
```bash
cargo nt # Full test pass (cargo alias: nextest run --all-features)
cargo nt -E 'test(/pattern/)' # Filter by pattern
cargo nt --test e2e_samples # E2E sample tests only
cargo ntw # Full test pass with output (--no-capture)
cargo c # Clippy (--all-targets --all-features)
cargo f # Format
```
Aliases are defined in `.cargo/config.toml`. Only fall back to `cargo test` if nextest is not available. Never use `cargo test` when nextest works.
## Architecture Overview
**Single-container CosmosDB design** with `/instanceId` partition key. Six document types coexist:
- `instance` — orchestration metadata + lock state
- `history` — append-only event log
- `orch_queue` — orchestration dispatcher work items
- `worker_queue` — activity worker work items
- `outbox_intent` — pending cross-partition writes
- `session` — session affinity ownership
**Key design patterns:**
- Optimistic concurrency via ETags (no database locks)
- 256-slot dispatch partitioning via `LeaseProvider`
- Transactional outbox for cross-partition writes (sub-orchestrations, completions)
- Raw REST client (no Azure SDK) with HMAC-SHA256 auth
## Module Map
| `provider.rs` | `Provider` + `ProviderAdmin` trait implementations (~2000 lines, main logic) |
| `client.rs` | `CosmosDBClient` — HTTP transport, auth, CRUD, query, batch |
| `models.rs` | Serde structs for all 6 document types |
| `query.rs` | Cross-partition and single-partition query builders |
| `batch.rs` | Transactional batch operations (up to 100 ops per batch) |
| `containers.rs` | Database/container bootstrapping with retry for 429s |
| `outbox.rs` | Outbox intent delivery + background reconciler |
| `leases.rs` | `LeaseProvider` trait + `InMemoryLeaseProvider` (256-slot distribution) |
| `errors.rs` | CosmosDB HTTP status → `ProviderError` mapping |
| `lib.rs` | Public API re-exports |
## Critical Implementation Details
### Cross-Partition Query Limitations (REST Gateway)
CosmosDB's REST gateway does NOT support cross-partition queries with:
- `ORDER BY` — returns 400 with query plan
- Server-side aggregates (`COUNT`, `MAX`, `SUM`) — same issue
- `TOP N` combined with `ORDER BY`
**How we handle this:**
- Queue candidate queries: no `ORDER BY`, fetch all matching items, sort **client-side** by `enqueuedAt`
- Count queries: `SELECT c.id` + client-side `.len()` instead of `SELECT VALUE COUNT(1)`
- Partition-scoped queries (history, messages): CAN use `ORDER BY` safely
### Session Piggyback Timestamps
`ack_work_item` and `renew_work_item_lock` piggyback-update session `lastActivity`. They must use a **fresh `now_ms()`** at the point of update, NOT the timestamp from the start of the function. Network latency makes early timestamps stale, causing idle window checks to fail.
### Transactional Batch Limits
CosmosDB allows max 100 operations per batch. `ack_orchestration_item` splits into sequential batches if exceeded. The first batch includes the instance upsert (releases the lock). Subsequent batches are best-effort.
### Cancelled Activity Deletes Are Best-Effort
When an orchestration cancels an in-flight activity (e.g., `select2` picks a timer over an activity), the cancelled activity's `worker_queue` document must NOT be deleted inside the transactional batch. The worker dispatcher may have already consumed and deleted it, causing a 404 inside the batch which fails the entire transaction with 424. Instead, cancelled activity deletes are performed best-effort after the batch commits, silently ignoring 404s.
### Dispatch Slot Partitioning
Every queue item has a precomputed `dispatchSlot = hash(instanceId) % 256`. Dispatchers only query their assigned slots via `AND c.dispatchSlot IN (...)`. When all 256 slots are assigned (single dispatcher), skip the `IN` clause entirely.
### Optimistic Locking Pattern
All lock operations follow: read → get `_etag` → modify → conditional replace with `If-Match` header → handle 412/409.
## Error Handling
- 409 Conflict → `retryable` (ETag race or duplicate)
- 412 Precondition Failed → `retryable` (ETag mismatch)
- 429 Too Many Requests → `retryable` (rate limited, backoff)
- 404 Not Found → `permanent`
- 400 Bad Request → `permanent` (likely a cross-partition query issue)
## Configuration
Environment variables (`.env` file, loaded via `dotenvy`):
- `COSMOS_ENDPOINT` — CosmosDB endpoint URL
- `COSMOS_KEY` — CosmosDB master key
- `COSMOS_DATABASE` — Database name (default: `duroxide`)
Programmatic config via `CosmosDBProviderConfig`:
- `orch_concurrency` / `worker_concurrency` — number of dispatchers (controls slot partitioning)
- `reconciler_interval` / `reconciler_age_threshold` — outbox reconciler timing (default: 2s each)
## Testing
### Test Infrastructure
- Tests use **per-test containers** (unique UUID suffix) for isolation
- Test concurrency limited to 4 threads (`.config/nextest.toml`) to avoid metadata 429s
- `ensure_infrastructure` has retry with exponential backoff for container creation
### Provider Validation Tests
`tests/cosmosdb_provider_test.rs` implements `ProviderFactory` and runs ~196 duroxide validation tests. Each test gets its own container.
### E2E Sample Tests
`tests/e2e_samples.rs` ports the full duroxide e2e sample suite. Uses `tests/common/mod.rs` for shared helpers (`create_cosmos_store`, `wait_for_history`, etc.).
### Local Development
```bash
# Start CosmosDB emulator
docker run -p 8081:8081 -p 10250-10255:10250-10255 \
mcr.microsoft.com/cosmosdb/linux/azure-cosmos-emulator:latest
# Copy env and run tests
cp .env.example .env
cargo nextest run --features provider-test
```
## Build Commands
```bash
cargo nt # Full test pass (nextest + all features)
cargo nt --test e2e_samples # E2E samples only
cargo c # Clippy (all targets, all features)
cargo build --all-targets # Build everything
cargo check --features provider-test --tests # Quick type check
```
## Key Directories
- `src/` — Provider implementation
- `tests/` — Integration tests (provider validation, e2e samples)
- `tests/common/` — Shared test helpers
- `docs/` — Architecture documentation
## When Changing Provider Code
1. Run full test pass: `cargo nt`
2. Run e2e samples: `cargo nt --test e2e_samples`
3. Check for cross-partition query issues (no `ORDER BY`/aggregates in cross-partition queries)
4. Ensure session piggyback updates use fresh `now_ms()`
5. Verify transactional batch stays under 100 operations
6. Run clippy: `cargo c`
## Workflow Rules
- **Never commit or push without explicit user permission.**
- **Always use nextest** for running tests (`cargo nt`).
- **Test against both emulator and Azure CosmosDB** — the emulator is more lenient with cross-partition queries.
## Updating duroxide Dependency
When a new version of `duroxide` is published to crates.io:
1. **Review changes**: Read the duroxide CHANGELOG, README, and `docs/provider-implementation-guide.md` at the duroxide repo (drox root: `duroxide/`)
2. **Update Cargo.toml**: Bump `duroxide` version, run `cargo check`, fix any compilation errors
3. **Implement API changes**: Update `src/provider.rs` for any `Provider`/`ProviderAdmin` trait changes
4. **Check cross-partition query impact**: If new provider methods involve multi-instance queries, ensure no `ORDER BY` or server-side aggregates in cross-partition queries
5. **Add validation tests**: New provider validation tests from duroxide will automatically run via the `ProviderFactory` pattern in `tests/cosmosdb_provider_test.rs`
6. **Test thoroughly**: `cargo nt` (full test pass), `cargo nt --test e2e_samples` (e2e samples)
7. **Update docs**: CHANGELOG.md, README.md, bump version in Cargo.toml
8. **Publish**: `cargo publish` (only with explicit user permission)
> ⚠️ **Never push to remote or publish to crates.io without explicit user confirmation**
> ⚠️ **Keep consistent with PG providers** — all providers must implement the same `Provider`/`ProviderAdmin` traits. Check `providers/duroxide-pg/` and `providers/duroxide-pg-opt/` for reference on how they handle the same changes.
## Self-Updating These Instructions
When you discover important implementation details, design decisions, or gotchas that would help future sessions, **update this file** (`.github/copilot-instructions.md`) and/or `docs/ARCHITECTURE.md`. Examples:
- New cross-partition query limitations encountered
- Semantic differences between CosmosDB and Postgres that affect correctness
- New critical invariants or race conditions identified
- Changes to build/test commands or aliases
**Always notify the user** when you modify these instructions — never silently update them.