## Duroxide – TODOs
### Active TODOs
- Write big rocks docs.
- Publish crate, docs
- RaiseEvent pub/sub
- RaiseEvent should target any Wait for Event, even if it was to come in the future.
- **Nested Select2 Support** - Enable `select2(select2(a, b), c)` composition
- Currently `select2` only accepts `DurableFuture`, not `SelectFuture`
- Options: (1) Add trait for select2 args that both types implement, (2) Add `SelectFuture::into_selectable()` wrapper
- Would enable more expressive race patterns like `select2(select2(fast_path, timeout), fallback)`
- Complexity: Medium - API change, need to handle nested loser tracking
- **History Validation**
- Need to validate histories are well-formed before replay
- Checks: event_id ordering, source_event_id references valid scheduling events, no duplicate event_ids
- Invalid histories should fail fast with clear error, not cause undefined behavior
- Related to select2 loser fix: tests assume valid histories
- **Orchestration-level Metrics** - Add metric recording from orchestration code
- Similar to existing `ctx.trace_*()`, `ctx.new_guid()`, `ctx.utcnow()` system calls
- Add `ctx.record_counter(name, value, labels)`, `ctx.record_histogram(name, value, labels)` etc.
- Would enable retry metrics, custom business metrics from orchestration code
- Must be deterministic (recorded as SystemCall events for replay)
- Complexity: Medium - extend SystemCall machinery, wire to MetricsProvider
- Full e2e Otel test
- Wire up suborchestration metrics, requires some linkage of names between suborch -> parent in the replayengine
- Review DurableFuture and into_*() fns for soundness
- move stress tests to cargo standard "bench"
- separate execution loops for orchestrations and activities, communicate through channels
- Port samples from DurableTasks and Temporal to tests/scenarios/
- Limits everywhere, orch/activity names, input/output event sizes, history sizes etc.
- LLM-orchestration/provider
- Revive batching from dispatcher-batching branch, currently the perf dropped drastically. Might've been a sqlite-only issue.
- Mgmt API feedback
- list instances should get back version, status, and other basic info.
- truncate API for instances
- truncate API for executions, to be called in eternal orchestrations. Or maybe default it.
- API endpoint for runtime, MCP server, orch status etc.
- Metrics/logging dashboard
- Refactor tests, figure out what are the unit tests vs not
- some failure stress tests.
- [ ] **Infrastructure error handling from provider operations**
- Currently: Provider errors (append/enqueue failures) cause retries via lock expiration
- Future: Implement visibility timeout to push failed instances to back of queue
- Related: May need abandon with delay for persistent infrastructure failures
- Priority: Medium (current retry mechanism works but could be more sophisticated)
- Drop crates/dependencies that aren't needed
- Add orchestrator functions
- Macros for syntactic sugar [DESIGNED - see docs/proposals/MACRO-FINAL-DESIGN.md]
- activity session management - cancellations, progress report, messaging.
- reduce the junk in stress tests, make them useful
- name the dispatchers
- Update provider docs
- provider layer should have retries built into it and not sprinkled across all the code
- need to build an azure storage provider
- should focus on simplification, refactoring and then threading for dispatchers before synatic sugars
- pick proposal from branch macro3 - macro and macro2 branches have fuller impls but not clean
- macros for activities, orchestrations
- orchestration functions!
- build node and python bindings as well
- convert execution, instance statuses to enums
- Introduce a provider error type with Retryable/NonRetryable classification; update runtime to use it for retries across all provider ops (not just ack_orchestration_item)
- Proper lock / visibility timeouts
- review duplicate orch instance ids
- perf pass over replay engine
- parallelize dispatcher loops
- lock TTL for timer and worker queues and update lease
- Reduce ornamental user code in orchestrations and acivities
- Continue the provider simplification
- Rename to provider
- fault inject: "TODO : fault injection :"
- code coverage
- CLI tooling to manipulate history/queue state for devops
- add cancellation and status from within the orchestration
- example versioned crates with orchestrations and loaders
- profiling replay and providers
- performance improvements for runtime
- "pub-sub" external events
- sharding and scale out to multiple runtimes
- strongly typed activity and orchestration calls?
- cancellations via JoinHandles?
- Add proper metrics.
- Build a website to visualize execution.
- Write an Azure Blob based provider.
- Batch the calls to logging, don't spin up an activity per
- Orchestration state? Monitoring? Visiblity?
- Real world samples (provisioning resources in Azure e.g.)
## DONE
- Github actions/pipeline
- Docs review for duroxide, duroxide-pg.
- **[BLOCKER] Event Schema Redesign: Common Fields + Timestamps**
- Current Issue: `duroxide_orchestration_duration_seconds` only measures final turn duration (40ms), not total orchestration lifetime (could be hours/days)
- Proposed Solution: Restructure Event from flat enum to struct + EventKind enum (see `proposals/event-schema-redesign.md`)
- Common fields: `event_id`, `instance_id`, `execution_id`, `timestamp_ms`, `duroxide_version` (crate semver)
- Event-specific fields: `kind: EventKind` enum
- Impact: Large refactor (~40 event creation sites, ~140+ match sites), but no DB schema changes (JSON is flexible)
- Benefits: Accurate duration metrics, self-contained events, cleaner API, easy to add common fields, forward compatible
- Complexity: High - Touches many files, but serde provides backward compat via `#[serde(default)]`
- Related: WorkItem should get same treatment (also persisted as JSON in queues)
- Design Doc: `proposals/event-schema-redesign.md`, `docs/event-schema-evolution.md`
- **[FIXED] Select2 Loser Completions Block FIFO Ordering**
- **Activity Retry Policy** - Implement built-in activity retry logic
- Design doc: `proposals/activity-retry-policy.md`
- Fix up the stress tests, make them more usable by providers
- Do a pass on registries for interface consistency
- Disable infra logs by default
- conventions and changes for exporting and importing orchestration and activity registries across crates
- make the provider tests usable by outside implementors
- Add new provider tests
- prompt files??
- multi threaded timer/worker dispatchers
- separate out dispatchers so they can run independently
- multi threaded orchestrator dispatcher
- Provider cleanup:
-- create_with_execution (ack_orchestration_item already creates new history)
-- append_with_execution (unused)
- sqlite instance table should not have a status, it should only be in the execution table
- updated to sqlx 0.8
- remove unused crate dependencies
- Update all provider and API docs, get ready to push crate
- implementing a provider requires the implementor to look deeply at the sqlite provider code
- clean up leaky abstractions (provider should not have to compute the orchestration state and output e.g)
- drop the polling frequence for the dispatchers
- Cleanup the docs before making public
- write a real world orchestrations with versioning etc
- fix up the tracing to not use activities [DONE - tracing is now host-side only]
- host level events (tracing, guid, time) [IN PROGRESS via RFC]
- SQLite provider with full transactional support and e2e test parity
- Full ACID transactional semantics
- Provider-backed timer queue with delayed visibility
- Handles concurrent instance execution
- All 25 core e2e tests passing
- Fixed trace activities to be fire-and-forget (no longer cause nondeterminism)
- Timer acknowledgment only after firing (reliability fix)
- Worker queue acknowledgment only after completion enqueue (reliability fix)
- Return instance history as well
- Update history + complete locks + enqueue in the same call
- typed parameters for activities and orchestrations
- proper timer implementation in the provider
- versioning strategy
- Support for orchestration chaining + eternal orchestrations
- Need to understand this oneshot channel to await on
- transactional processing - harden queue read semantics (see docs/reliability-queue-and-history.md)
- review how active_instances in the runtime work, why are we reenqueuing events in certain cases
- On next session: review tests covering ContinueAsNew and multi-execution IDs
- Files: `tests/e2e_continue_as_new.rs` (both tests)
- Also revisit runtime APIs to surface `list_executions` and `get_execution_history` consistently- ContinueAsNew support
- test hung because the initial orchestration takes longer to start!
- tests for orchestration state,
- tests for provider <-> runtime resumption of multiple orchestrations.
- dequeue multiple item batch
- do a pass through the code.
- Add orchestration registry
- Add capability to the runtime to resume persisted orchestrations from the history provider
- Add signalling mechanism in the provider which runtime can poll to trigger replay
- resolve the test hang!!
- Document all methods
- Add proper logging
- Write a file system based provider.
- Crash recovery tests.
- Error handling for bad activity or orchestration names, or accessing wrong instance IDs
- Add a Gabbar test
- Logging typed handlers
- Max size of orchestration history
- Detailed documentation in the docs folder for how the system works
- Remove dead code, including the one with allow(dead_code)
- Write detailed architecture and user documentation
- Formalize a provider model for the state, queues and timers.
- Write GUID and time deterministic helper methods.
## POSTPONED
- clients should be able to query registered activities and orchestrations
- implemented in branch runtime-registry, but it needs much more than just the orch name/version
- implement Unpin typed future wrappers for `_typed` adapters
- redo the orchestration registry change with gpt5 and compare
- mermaid diagrams for orchestrations???
- remove the into_activity() and similar methods