duroxide 0.1.0

Deterministic task orchestration framework for Rust - inspired by Durable Task Framework
Documentation
## Duroxide – TODOs

### Active TODOs

- Write big rocks docs.
- Publish crate, docs
- RaiseEvent pub/sub
- RaiseEvent should target any Wait for Event, even if it was to come in the future.
- **Nested Select2 Support** - Enable `select2(select2(a, b), c)` composition
  - Currently `select2` only accepts `DurableFuture`, not `SelectFuture`
  - Options: (1) Add trait for select2 args that both types implement, (2) Add `SelectFuture::into_selectable()` wrapper
  - Would enable more expressive race patterns like `select2(select2(fast_path, timeout), fallback)`
  - Complexity: Medium - API change, need to handle nested loser tracking

- **History Validation**
  - Need to validate histories are well-formed before replay
  - Checks: event_id ordering, source_event_id references valid scheduling events, no duplicate event_ids
  - Invalid histories should fail fast with clear error, not cause undefined behavior
  - Related to select2 loser fix: tests assume valid histories

- **Orchestration-level Metrics** - Add metric recording from orchestration code
  - Similar to existing `ctx.trace_*()`, `ctx.new_guid()`, `ctx.utcnow()` system calls
  - Add `ctx.record_counter(name, value, labels)`, `ctx.record_histogram(name, value, labels)` etc.
  - Would enable retry metrics, custom business metrics from orchestration code
  - Must be deterministic (recorded as SystemCall events for replay)
  - Complexity: Medium - extend SystemCall machinery, wire to MetricsProvider
- Full e2e Otel test
- Wire up suborchestration metrics, requires some linkage of names between suborch -> parent in the replayengine
- Review DurableFuture and into_*() fns for soundness
- move stress tests to cargo standard "bench"
- separate execution loops for orchestrations and activities, communicate through channels
- Port samples from DurableTasks and Temporal to tests/scenarios/
- Limits everywhere, orch/activity names, input/output event sizes, history sizes etc. 
- LLM-orchestration/provider
- Revive batching from dispatcher-batching branch, currently the perf dropped drastically. Might've been a sqlite-only issue. 
- Mgmt API feedback
  - list instances should get back version, status, and other basic info.
  - truncate API for instances
  - truncate API for executions, to be called in eternal orchestrations. Or maybe default it. 
- API endpoint for runtime, MCP server, orch status etc. 
- Metrics/logging dashboard 
- Refactor tests, figure out what are the unit tests vs not
- some failure stress tests. 
- [ ] **Infrastructure error handling from provider operations**
  - Currently: Provider errors (append/enqueue failures) cause retries via lock expiration
  - Future: Implement visibility timeout to push failed instances to back of queue
  - Related: May need abandon with delay for persistent infrastructure failures
  - Priority: Medium (current retry mechanism works but could be more sophisticated)
- Drop crates/dependencies that aren't needed
- Add orchestrator functions
- Macros for syntactic sugar [DESIGNED - see docs/proposals/MACRO-FINAL-DESIGN.md]
- activity session management - cancellations, progress report, messaging.
- reduce the junk in stress tests, make them useful
- name the dispatchers
- Update provider docs
- provider layer should have retries built into it and not sprinkled across all the code
- need to build an azure storage provider
- should focus on simplification, refactoring and then threading for dispatchers before synatic sugars
 - pick proposal from branch macro3 - macro and macro2 branches have fuller impls but not clean
- macros for activities, orchestrations
- orchestration functions!
- build node and python bindings as well
- convert execution, instance statuses to enums
- Introduce a provider error type with Retryable/NonRetryable classification; update runtime to use it for retries across all provider ops (not just ack_orchestration_item)
- Proper lock / visibility timeouts
- review duplicate orch instance ids
- perf pass over replay engine
- parallelize dispatcher loops
- lock TTL for timer and worker queues and update lease
- Reduce ornamental user code in orchestrations and acivities
- Continue the provider simplification
- Rename to provider
- fault inject: "TODO : fault injection :"
- code coverage
- CLI tooling to manipulate history/queue state for devops
- add cancellation and status from within the orchestration
- example versioned crates with orchestrations and loaders
- profiling replay and providers
- performance improvements for runtime
- "pub-sub" external events
- sharding and scale out to multiple runtimes
- strongly typed activity and orchestration calls?
- cancellations via JoinHandles?
- Add proper metrics.
- Build a website to visualize execution.
- Write an Azure Blob based provider.
- Batch the calls to logging, don't spin up an activity per
- Orchestration state? Monitoring? Visiblity? 
- Real world samples (provisioning resources in Azure e.g.)

## DONE

- Github actions/pipeline
- Docs review for duroxide, duroxide-pg.
- **[BLOCKER] Event Schema Redesign: Common Fields + Timestamps**
  - Current Issue: `duroxide_orchestration_duration_seconds` only measures final turn duration (40ms), not total orchestration lifetime (could be hours/days)
  - Proposed Solution: Restructure Event from flat enum to struct + EventKind enum (see `proposals/event-schema-redesign.md`)
  - Common fields: `event_id`, `instance_id`, `execution_id`, `timestamp_ms`, `duroxide_version` (crate semver)
  - Event-specific fields: `kind: EventKind` enum
  - Impact: Large refactor (~40 event creation sites, ~140+ match sites), but no DB schema changes (JSON is flexible)
  - Benefits: Accurate duration metrics, self-contained events, cleaner API, easy to add common fields, forward compatible
  - Complexity: High - Touches many files, but serde provides backward compat via `#[serde(default)]`
  - Related: WorkItem should get same treatment (also persisted as JSON in queues)
  - Design Doc: `proposals/event-schema-redesign.md`, `docs/event-schema-evolution.md`
- **[FIXED] Select2 Loser Completions Block FIFO Ordering**
- **Activity Retry Policy** - Implement built-in activity retry logic
  - Design doc: `proposals/activity-retry-policy.md`
- Fix up the stress tests, make them more usable by providers
- Do a pass on registries for interface consistency
- Disable infra logs by default
- conventions and changes for exporting and importing orchestration and activity registries across crates
- make the provider tests usable by outside implementors
- Add new provider tests
- prompt files??
- multi threaded timer/worker dispatchers
- separate out dispatchers so they can run independently
- multi threaded orchestrator dispatcher
- Provider cleanup:
  -- create_with_execution (ack_orchestration_item already creates new history)
  -- append_with_execution (unused)
- sqlite instance table should not have a status, it should only be in the execution table
- updated to sqlx 0.8
- remove unused crate dependencies
- Update all provider and API docs, get ready to push crate
- implementing a provider requires the implementor to look deeply at the sqlite provider code
- clean up leaky abstractions (provider should not have to compute the orchestration state and output e.g)
- drop the polling frequence for the dispatchers
- Cleanup the docs before making public
- write a real world orchestrations with versioning etc
- fix up the tracing to not use activities [DONE - tracing is now host-side only]
- host level events (tracing, guid, time) [IN PROGRESS via RFC]
- SQLite provider with full transactional support and e2e test parity
  - Full ACID transactional semantics
  - Provider-backed timer queue with delayed visibility
  - Handles concurrent instance execution
  - All 25 core e2e tests passing
- Fixed trace activities to be fire-and-forget (no longer cause nondeterminism)
- Timer acknowledgment only after firing (reliability fix)
- Worker queue acknowledgment only after completion enqueue (reliability fix)
- Return instance history as well
- Update history + complete locks + enqueue in the same call
- typed parameters for activities and orchestrations
- proper timer implementation in the provider
- versioning strategy
- Support for orchestration chaining + eternal orchestrations
- Need to understand this oneshot channel to await on
- transactional processing - harden queue read semantics (see docs/reliability-queue-and-history.md)
- review how active_instances in the runtime work, why are we reenqueuing events in certain cases
- On next session: review tests covering ContinueAsNew and multi-execution IDs
	- Files: `tests/e2e_continue_as_new.rs` (both tests)
	- Also revisit runtime APIs to surface `list_executions` and `get_execution_history` consistently- ContinueAsNew support
- test hung because the initial orchestration takes longer to start!
- tests for orchestration state, 
- tests for provider <-> runtime resumption of multiple orchestrations. 
- dequeue multiple item batch
- do a pass through the code. 
- Add orchestration registry
- Add capability to the runtime to resume persisted orchestrations from the history provider
- Add signalling mechanism in the provider which runtime can poll to trigger replay
- resolve the test hang!!
- Document all methods
- Add proper logging
- Write a file system based provider.
- Crash recovery tests.
- Error handling for bad activity or orchestration names, or accessing wrong instance IDs
- Add a Gabbar test
- Logging typed handlers
- Max size of orchestration history
- Detailed documentation in the docs folder for how the system works
- Remove dead code, including the one with allow(dead_code)
- Write detailed architecture and user documentation 
- Formalize a provider model for the state, queues and timers.
- Write GUID and time deterministic helper methods.

## POSTPONED

- clients should be able to query registered activities and orchestrations
  - implemented in branch runtime-registry, but it needs much more than just the orch name/version
- implement Unpin typed future wrappers for `_typed` adapters
- redo the orchestration registry change with gpt5 and compare
- mermaid diagrams for orchestrations???
- remove the into_activity() and similar methods