# duroxide-cdb Architecture
> Comprehensive design and implementation reference for the CosmosDB NoSQL provider for [duroxide](https://github.com/microsoft/duroxide).
---
## Table of Contents
1. [Overview](#1-overview)
2. [Data Model](#2-data-model)
3. [Module Map](#3-module-map)
4. [REST Client & Authentication](#4-rest-client--authentication)
5. [Dispatch Slot Partitioning](#5-dispatch-slot-partitioning)
6. [Core Algorithms](#6-core-algorithms)
7. [Transactional Outbox](#7-transactional-outbox)
8. [Session Affinity](#8-session-affinity)
9. [Error Handling](#9-error-handling)
10. [CosmosDB vs Postgres Semantic Differences](#10-cosmosdb-vs-postgres-semantic-differences)
11. [Cross-Partition Query Strategy](#11-cross-partition-query-strategy)
12. [Infrastructure Bootstrapping](#12-infrastructure-bootstrapping)
13. [Configuration](#13-configuration)
14. [Testing](#14-testing)
---
## 1. Overview
`duroxide-cdb` implements the `Provider` and `ProviderAdmin` traits from the duroxide runtime, backed by Azure CosmosDB NoSQL API. All orchestration state — instance metadata, event history, work queues, sessions, and outbox intents — lives in a **single container** with `/instanceId` as the partition key.
### Design Principles
| **Single container, partition-per-instance** | All documents for an orchestration instance share the same logical partition. Intra-instance operations use CosmosDB transactional batch for atomicity. |
| **Transactional outbox for cross-partition writes** | Sub-orchestration starts, completions, and detached orchestration starts record intents inside the source partition's transaction, then deliver best-effort + background reconciler. |
| **Optimistic concurrency via ETags** | Instance locking and queue item locking use CosmosDB `If-Match` conditional writes instead of database-level locks. |
| **Lease-based dispatcher partitioning** | Concurrent dispatchers within a runtime partition the 256-slot keyspace so they never contend on the same queue items. |
| **Raw REST API** | Direct HTTP calls to CosmosDB (`reqwest`), avoiding the Azure SDK. Auth uses HMAC-SHA256 master key tokens. |
### Why Not the Azure SDK?
The provider uses raw REST instead of `azure_data_cosmos` for:
- **Full control** over headers (`x-ms-documentdb-partitionkey`, `x-ms-documentdb-isquery`, etc.)
- **Transactional batch** support via the undocumented-but-stable batch endpoint
- **Minimal dependency footprint** — only `reqwest`, `hmac`, `sha2`, `base64`
- **Cross-partition query control** — explicit header management for `enablecrosspartition`
---
## 2. Data Model
All document types coexist in one container, discriminated by a `type` field. The partition key is `/instanceId`.
### 2.1 Document Types
```
┌─────────────────────────────────────────────────────────────────┐
│ CosmosDB Container │
│ Partition Key: /instanceId │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Partition: "order-123" │ │
│ │ │ │
│ │ [instance] order-123:instance │ │
│ │ [history] order-123:history:1:1 │ │
│ │ [history] order-123:history:1:2 │ │
│ │ [orch_queue] <uuid> (type=orch_queue) │ │
│ │ [worker_queue] <uuid> (type=worker_queue) │ │
│ │ [outbox] intent:<key> │ │
│ │ [session] order-123:session:sess-A │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Partition: "user-456" │ │
│ │ [instance] user-456:instance │ │
│ │ [history] user-456:history:1:1 │ │
│ │ ... │ │
│ └─────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
```
| **Instance** | `instance` | `<instanceId>:instance` | Orchestration metadata, lock state, custom status |
| **History** | `history` | `<instanceId>:history:<executionId>:<eventId>` | Append-only event log |
| **Orch Queue** | `orch_queue` | `<uuid>` | Messages for the orchestration dispatcher |
| **Worker Queue** | `worker_queue` | `<uuid>` | Activity execution requests for the worker dispatcher |
| **Outbox Intent** | `outbox_intent` | `intent:<idempotencyKey>` | Pending cross-partition writes |
| **Session** | `session` | `<instanceId>:session:<sessionId>` | Session affinity ownership tracking |
### 2.2 Key Relationships
```
StartOrchestration ──▶ orch_queue item ──▶ fetch_orchestration_item
│
┌────────────────────┘
▼
lock instance (ETag)
collect all orch_queue messages
fetch history
│
▼
ack_orchestration_item
┌─────────────────────────────┐
│ Transactional Batch: │
│ - DELETE orch_queue items │
│ - CREATE history events │
│ - CREATE worker_queue items │
│ - CREATE orch_queue items │
│ - CREATE outbox intents │
│ - UPSERT instance (unlock) │
└─────────────────────────────┘
│
┌───────────────┼───────────────┐
▼ ▼ ▼
worker_queue orch_queue outbox_intent
(activities) (same part.) (cross-partition)
│ │
▼ ▼
fetch_work_item deliver to target
ack_work_item partition + delete
```
### 2.3 Indexing Policy
The container uses a **selective indexing** strategy — only fields used in queries are indexed; large blob fields are excluded:
**Included paths:** `/type`, `/instanceId`, `/dispatchSlot`, `/visibleAt`, `/enqueuedAt`, `/lockedUntil`, `/executionId`, `/eventId`, `/status`, `/sessionId`, `/ownerId`, `/parentInstanceId`, `/createdAt`, `/customStatusVersion`, `/docType`, `/lockToken`
**Excluded paths:** `/eventData/*`, `/workItem/*`, `/payload/*`, `/output/*`, `/customStatus/*`, `/*` (catch-all)
**Composite indexes:**
1. `(type ASC, dispatchSlot ASC, visibleAt ASC, enqueuedAt ASC)` — serves fetch_orchestration_item / fetch_work_item
2. `(type ASC, executionId ASC, eventId ASC)` — serves history reads
3. `(type ASC, status ASC)` — serves list_instances_by_status
---
## 3. Module Map
```
src/
├── lib.rs Public API: re-exports CosmosDBProvider, CosmosDBProviderConfig
├── provider.rs Provider + ProviderAdmin trait implementations (~2000 lines)
├── client.rs CosmosDBClient: HTTP transport, auth, CRUD + query + batch
├── models.rs Serde structs for all 6 document types + helper constructors
├── query.rs Query builders: cross-partition candidate finders, history fetch
├── batch.rs Transactional batch builder: BatchOperation enum, execute_batch()
├── containers.rs Database/container creation with retry + indexing policy
├── outbox.rs Outbox intent delivery + background reconciler task
├── leases.rs LeaseProvider trait + InMemoryLeaseProvider (256-slot distribution)
└── errors.rs CosmosDB HTTP status → ProviderError mapping
```
### Module Dependencies
```
provider.rs
├── client.rs (HTTP operations)
├── query.rs (query builders)
├── batch.rs (transactional batch)
├── models.rs (document types)
├── outbox.rs (cross-partition delivery)
├── leases.rs (slot assignment)
├── containers.rs (infrastructure setup)
└── errors.rs (error mapping)
```
---
## 4. REST Client & Authentication
### 4.1 Authentication
Every request to CosmosDB requires an `Authorization` header computed from the master key:
```
HMAC-SHA256(key, verb + "\n" + resourceType + "\n" + resourceLink + "\n" + date + "\n\n")
```
The `CosmosDBClient` computes this for each request in `auth_header()`, using:
- `hmac::Hmac<sha2::Sha256>` for signing
- `base64` for encoding
- `urlencoding` for the authorization token
### 4.2 Common Headers
Every request includes:
- `x-ms-date` — RFC 1123 formatted UTC timestamp
- `x-ms-version` — `2020-07-15` (CosmosDB API version)
- `Authorization` — HMAC token
- `x-ms-documentdb-partitionkey` — JSON array, e.g. `["order-123"]`
### 4.3 Operations
| Create document | POST | `/dbs/{db}/colls/{coll}/docs` |
| Read document | GET | `/dbs/{db}/colls/{coll}/docs/{id}` |
| Replace document | PUT | `/dbs/{db}/colls/{coll}/docs/{id}` |
| Delete document | DELETE | `/dbs/{db}/colls/{coll}/docs/{id}` |
| Query | POST | `/dbs/{db}/colls/{coll}/docs` (with `x-ms-documentdb-isquery: true`) |
| Transactional batch | POST | `/dbs/{db}/colls/{coll}/docs` (with `x-ms-cosmos-batch-request: true`) |
| Create database | POST | `/dbs` |
| Create container | POST | `/dbs/{db}/colls` |
---
## 5. Dispatch Slot Partitioning
### 5.1 Problem
Multiple dispatcher tasks within a runtime call `fetch_orchestration_item` / `fetch_work_item` concurrently. Without coordination, they all query the same queue items and race on locks, wasting RUs.
### 5.2 Solution: 256-Slot Keyspace
Every queue item gets a precomputed `dispatchSlot` (0-255) derived from `hash(instanceId) % 256`. Dispatchers are assigned non-overlapping subsets of slots via the `LeaseProvider`:
```
With orch_concurrency = 3:
Dispatcher 0 → slots 0, 3, 6, 9, ..., 255 (86 slots)
Dispatcher 1 → slots 1, 4, 7, 10, ..., 253 (85 slots)
Dispatcher 2 → slots 2, 5, 8, 11, ..., 254 (85 slots)
```
Each dispatcher only queries for items in its assigned slots (`AND c.dispatchSlot IN (...)`), completely eliminating intra-runtime contention.
### 5.3 InMemoryLeaseProvider
The current implementation uses an in-memory lease provider with an `AtomicU32` counter. Each `acquire_slots()` call assigns the next slice of the 256-slot keyspace. Assignments are cached per `caller_id` (tokio task ID).
A future Phase 2 will implement `CosmosDBLeaseProvider` using lease documents for multi-runtime coordination.
---
## 6. Core Algorithms
### 6.1 fetch_orchestration_item
The most complex read path. Finds a candidate queue item, locks the owning instance, collects all pending messages, and returns the orchestration context.
```
1. Acquire dispatch slots from lease provider
2. Cross-partition query: find visible, unlocked orch_queue items in my slots
3. Sort results client-side by enqueuedAt (cross-partition ORDER BY unsupported)
4. Loop (up to 20 attempts, excluding locked instances):
a. Try to lock the instance (conditional replace with ETag)
b. Check capability filter against instance's pinned version
c. Collect all pending orch_queue messages for the instance (single-partition)
d. Tag messages with lock token (transactional batch)
e. Fetch history (single-partition, ordered by eventId)
f. Build OrchestrationItem and return
```
### 6.2 ack_orchestration_item
The most complex write path. Atomically commits a completed orchestration turn using a transactional batch within the instance's partition, plus outbox intents for cross-partition effects.
```
1. Validate lock token matches instance
2. Classify output items by partition:
- same-partition worker items → batch CREATE
- same-partition orch items → batch CREATE
- cross-partition items → outbox intents
3. Build transactional batch (single partition):
- DELETE locked orch_queue messages
- CREATE history events
- CREATE same-partition worker/orch queue items
- CREATE outbox intent documents
- UPSERT instance document (update metadata, release lock)
4. If batch exceeds 100 operations → split into sequential batches
5. Best-effort delete of cancelled activity worker_queue items (outside transaction)
6. Deliver outbox intents best-effort (outside transaction)
7. Background reconciler handles any failed deliveries
```
> **Cancelled activity deletes are NOT in the transactional batch.** When an
> orchestration cancels an in-flight activity (e.g., `select2` picks a timer/event
> over an activity), the worker dispatcher may have already fetched and deleted the
> activity's `worker_queue` document. A batch DELETE of a non-existent document
> returns 404, which causes the entire transactional batch to fail with 424
> (Failed Dependency). To avoid this race, cancelled activity deletes are performed
> best-effort after the batch commits. If the document is already gone, the 404 is
> silently ignored. This is safe because activity cancellation is itself
> best-effort — a completed activity result that arrives after cancellation is
> simply discarded by the runtime's replay logic.
### 6.3 fetch_work_item & ack_work_item
Simpler than the orchestrator path:
- **fetch_work_item**: Query → lock single queue item (conditional replace) → session claim if applicable → return work item
- **ack_work_item**: Delete worker_queue item + create completion orch_queue item (transactional batch within same partition) + piggyback session `last_activity` update
### 6.4 Optimistic Locking Pattern
All "lock" operations follow the same pattern:
```
1. Read document → get current _etag
2. Modify document (set lockToken, lockedUntil, etc.)
3. Replace with If-Match: <etag> header
4. On 412 Precondition Failed → another writer won; skip or retry
5. On 409 Conflict → document already exists; handle accordingly
```
No database-level locks are ever held. This means operations are non-blocking and scale naturally, at the cost of occasional retry loops.
---
## 7. Transactional Outbox
### 7.1 When It Is Used
Cross-partition writes cannot be atomic in CosmosDB. The outbox pattern ensures eventual delivery:
| Scenario | Source Partition | Target Partition |
|----------|-----------------|-----------------|
| Sub-orchestration start | Parent instance | Child instance |
| Sub-orchestration completion | Child instance | Parent instance |
| Detached orchestration start | Coordinator | New instance |
| Cancel cascade | Parent instance | Child instance |
### 7.2 Lifecycle
```
ack_orchestration_item
│
├── Transactional batch (within source partition):
│ CREATE outbox_intent { status: "pending", payload: <target doc JSON> }
│
├── Best-effort delivery (immediately after batch):
│ POST target doc to target partition
│ On success/409 → DELETE intent
│ On transient error → leave as "pending"
│
└── Background reconciler (every 2s):
Query pending intents older than 2s
Retry delivery for each
On success/409 → DELETE intent
On transient error → increment attemptCount, retry next cycle
```
### 7.3 Idempotency
Each outbox intent has a deterministic `idempotencyKey` derived from `<sourceInstance>:<executionId>:<eventSequence>`. The target document's `id` is set to `outbox:<idempotencyKey>`. Duplicate deliveries produce a 409 Conflict (treated as success).
### 7.4 Reconciler
Spawned as a background tokio task during provider initialization. Cancelled via `CancellationToken` on provider drop. Pending intents survive provider restarts — they remain in CosmosDB and are picked up by the next reconciler.
---
## 8. Session Affinity
Session affinity routes work items with the same `sessionId` to the same worker, enabling in-memory state reuse across activity invocations.
### 8.1 Session Documents
Stored in the same partition as the orchestration instance:
- `id`: `<instanceId>:session:<sessionId>`
- Tracks `ownerId`, `lockedUntil`, `lastActivity`
### 8.2 Session Lifecycle
1. **Claim**: During `fetch_work_item`, if the work item has a `sessionId`, read or create the session document. Only the owning worker (or a worker claiming an expired session) can process the item.
2. **Heartbeat**: `ack_work_item` and `renew_work_item_lock` piggyback-update `lastActivity` on the session document using a fresh `now_ms()` timestamp (not the stale one from the start of the operation — important for network-latency-sensitive idle window checks).
3. **Renewal**: `renew_session_lock` queries all session documents, filters by owner + non-expired + recent activity, and extends `lockedUntil`.
4. **Cleanup**: `cleanup_orphaned_sessions` deletes expired sessions with no pending work items.
---
## 9. Error Handling
### 9.1 CosmosDB Status → ProviderError Mapping
| Status | Meaning | ProviderError | Action |
|--------|---------|---------------|--------|
| 200-204 | Success | — | Continue |
| 400 | Bad Request | `permanent` | Query malformed |
| 404 | Not Found | `permanent` | Document absent |
| 408 | Timeout | `retryable` | Transient |
| 409 | Conflict | `retryable` | ETag race or duplicate |
| 412 | Precondition Failed | `retryable` | ETag mismatch |
| 413 | Entity Too Large | `permanent` | Batch too large |
| 429 | Rate Limited | `retryable` | Backoff + retry |
| 503 | Unavailable | `retryable` | Transient |
### 9.2 Infrastructure Retry
`ensure_infrastructure` (database + container creation) uses exponential backoff for 429 metadata rate-limiting: up to 10 retries starting at 500ms (500ms → 1s → 2s → 4s → ...).
---
## 10. CosmosDB vs Postgres Semantic Differences
Several SQL operations that are inherently safe in Postgres require different handling in CosmosDB due to fundamental semantic differences in how the two databases treat missing data and conflicts.
### 10.1 Summary Table
| Operation | Postgres Behavior | CosmosDB Behavior | CDB Provider Strategy |
|-----------|------------------|-------------------|----------------------|
| DELETE non-existent row/doc | No-op (0 rows affected), transaction continues | 404 error, fails entire transactional batch (424) | Best-effort delete outside batch, ignore 404 |
| INSERT duplicate primary key | Unique constraint violation, rolls back transaction | CREATE returns 409 (Conflict), fails entire batch (424) | Keep CREATE — atomicity test expects failure on duplicate event_ids |
| INSERT ON CONFLICT DO NOTHING | Silently skips duplicate, transaction continues | No equivalent in batch — CREATE or Upsert only | Use Upsert where idempotency is needed (e.g., instance doc), Create where uniqueness must be enforced (e.g., history) |
| UPDATE with WHERE matching 0 rows | No-op (0 rows affected), transaction continues | Conditional replace with non-matching ETag returns 412 | Treat 412 as "lost the race" — skip or retry depending on context |
| Transaction scope | Entire stored procedure is atomic (all-or-nothing) | Transactional batch is atomic but limited to single partition, max 100 ops | Split cross-partition writes into outbox intents; split >100 ops into sequential batches |
| Cross-partition query with ORDER BY | Works (query planner handles it) | 400 error (REST gateway returns query plan instead of results) | Fetch all matching items, sort client-side |
| Server-side aggregates (COUNT, SUM) | Works across all partitions | 400 error on cross-partition queries | SELECT ids + client-side `.len()` instead of COUNT |
### 10.2 Cancelled Activity Deletes
The most impactful difference. In `ack_orchestration_item`:
- **Postgres:** `DELETE FROM worker_queue WHERE instance_id = ... AND activity_id = ...` — if the worker already consumed the item, the DELETE affects 0 rows silently. The transaction proceeds.
- **CosmosDB:** A batch `DELETE` of a document ID that doesn't exist returns 404, which fails the entire transactional batch with 424 (Failed Dependency). Every other operation in the batch is rolled back.
**Solution:** Cancelled activity deletes are performed best-effort *after* the batch commits. If the document is already gone, the 404 is silently ignored.
### 10.3 History Event Uniqueness
- **Postgres (latest migration):** Plain `INSERT INTO history` — no `ON CONFLICT` clause. A duplicate `(instance_id, execution_id, event_id)` raises a unique constraint violation, rolling back the entire stored procedure.
- **CosmosDB:** Batch `CREATE` with a deterministic document ID. A duplicate ID returns 409, failing the batch with 424.
Both behaviors are equivalent: duplicate event_ids indicate a double-ack attempt, which must fail to preserve atomicity. The `test_atomicity_failure_rollback` validation test verifies this.
> **Note:** Postgres has a *separate* `append_history` function that uses `INSERT ON CONFLICT DO NOTHING` for idempotent history appends in a different code path. This is not the same as the inline history insert within `ack_orchestration_item`.
### 10.4 Session Piggyback Updates
- **Postgres:** `UPDATE sessions SET last_activity_at = ... WHERE locked_until > now` — if the session lock expired, matches 0 rows silently.
- **CosmosDB:** Conditional replace with ETag. A 412 (Precondition Failed) means the session was modified concurrently. Silently ignored since `last_activity_at` is advisory.
### 10.5 Lock Acquisition
- **Postgres:** `INSERT ON CONFLICT DO UPDATE ... WHERE locked_until <= now` — atomic upsert that silently returns 0 rows if the lock is held.
- **CosmosDB:** Read → conditional replace with `If-Match` ETag. A 412 means another dispatcher won the race. Silently skipped.
---
## 11. Cross-Partition Query Strategy
CosmosDB's gateway mode via REST API has limitations on cross-partition queries:
| Feature | Supported via REST Gateway? |
|---------|---------------------------|
| Simple filter (`WHERE c.type = 'x'`) | Yes |
| `ORDER BY` | **No** — returns 400 with query plan |
| Server-side aggregates (`COUNT`, `MAX`) | **No** — requires query plan decomposition |
| `TOP N` with `ORDER BY` | **No** |
### How We Handle This
- **Queue candidate queries** (`find_candidate_orch_item`, `find_candidate_work_item`): No `ORDER BY` or `TOP`. Fetch all matching items, sort client-side by `enqueuedAt`, take the first.
- **Dispatch slot filter**: Omitted when all 256 slots are assigned (the clause is redundant and bloats the query).
- **Count queries** (`count_by_type`): `SELECT c.id` + client-side `.len()` instead of `SELECT VALUE COUNT(1)`.
- **Partition-scoped queries** (history, messages): Can use `ORDER BY` safely since they target a single partition.
---
## 12. Infrastructure Bootstrapping
On provider initialization, `ensure_infrastructure` creates the database and container if they don't exist:
```
ensure_infrastructure (with retry for 429s)
├── ensure_database
│ POST /dbs { "id": "<database>" }
│ 201 = created, 409 = already exists (both OK)
│
└── ensure_container
POST /dbs/<db>/colls {
"id": "<container>",
"partitionKey": { "paths": ["/instanceId"], "kind": "Hash" },
"indexingPolicy": { ... selective indexing ... }
}
201 = created, 409 = already exists (both OK)
```
The indexing policy is set at creation time. Changing it after creation requires manual intervention.
---
## 13. Configuration
### CosmosDBProviderConfig
| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `endpoint` | `String` | — | CosmosDB endpoint URL |
| `key` | `String` | — | CosmosDB master key |
| `database` | `String` | `"duroxide"` | Database name |
| `container` | `String` | `"duroxide"` | Container name |
| `orch_concurrency` | `u32` | `1` | Number of orchestration dispatchers (for slot partitioning) |
| `worker_concurrency` | `u32` | `1` | Number of worker dispatchers (for slot partitioning) |
| `reconciler_interval` | `Duration` | `2s` | Outbox reconciler poll interval |
| `reconciler_age_threshold` | `Duration` | `2s` | Minimum intent age before reconciler picks it up |
### Environment Variables
| Variable | Description | Default |
|----------|-------------|---------|
| `COSMOS_ENDPOINT` | CosmosDB account endpoint | `http://localhost:8081` (emulator) |
| `COSMOS_KEY` | CosmosDB account key | Emulator default key |
| `COSMOS_DATABASE` | Database name | `duroxide` |
---
## 14. Testing
### 13.1 Provider Validation Tests
The `cosmosdb_provider_test.rs` test file implements the `ProviderFactory` trait and runs the duroxide provider validation suite (~196 tests covering atomicity, concurrency, sessions, cancellation, error handling, capability filtering, and more).
Each test gets an **isolated container** (unique UUID suffix) to ensure test independence. The container is cleaned up after each test.
**Running:** `cargo nextest run --features provider-test`
**Concurrency:** Limited to 4 threads (via `.config/nextest.toml`) to avoid CosmosDB metadata rate-limiting (429) during container creation.
### 13.2 E2E Sample Tests
`e2e_samples.rs` ports the full duroxide e2e sample suite (hello world, fan-out/fan-in, sub-orchestrations, timers, external events, sessions, continue-as-new, cancellation, etc.) to run against CosmosDB.
### 13.3 Local Development
Use the [CosmosDB Linux Emulator](https://learn.microsoft.com/en-us/azure/cosmos-db/emulator) via Docker for local development:
```bash
docker run -p 8081:8081 -p 10250-10255:10250-10255 \
mcr.microsoft.com/cosmosdb/linux/azure-cosmos-emulator:latest
```
Copy `.env.example` to `.env`— the defaults point to the local emulator.
### 13.4 Azure CosmosDB
For cloud testing, create a serverless CosmosDB account and set `COSMOS_ENDPOINT` and `COSMOS_KEY` in `.env`.