dagx
A minimal, type-safe, runtime-agnostic async DAG (Directed Acyclic Graph) executor with compile-time dependency validation and true parallel execution.
Why dagx? No custom scheduler—just battle-tested primitives (channels, async/await). Compile-time type safety catches wiring errors before runtime. Works with ANY async runtime (Tokio, async-std, smol). Simple API: #[task], add_task(), depends_on()—that's it. 4-148x faster than dagrs across all workloads with sub-microsecond overhead per task.
Features
- Compile-time type safety: Dependencies are validated at compile time through the type system. No
dyn Any, no downcasting, no runtime type errors. - Runtime-agnostic: Works with any async runtime (Tokio, async-std, smol, and more)
- True parallelism: Tasks spawn to multiple threads for genuine parallel execution
- Type-state pattern: The API prevents incorrect dependency wiring through compile-time errors
- Zero-cost abstractions: Leverages generics and monomorphization for minimal overhead
- Flexible task patterns: Supports stateless, read-only, and mutable state tasks
- Simple API: Just
#[task],DagRunner,TaskHandle, andTaskBuilder - Comprehensive error handling: Result-based errors with actionable messages
Design Philosophy: Primitives as Scheduler
dagx takes an unconventional approach to task orchestration: there is no traditional scheduler. Instead of complex scheduling logic managing when tasks run, the entire system is built on communication and synchronization primitives that do the scheduling themselves.
How It Works
Traditional DAG executors contain substantial scheduling code—algorithms that track task states, manage dependencies, coordinate execution order, and handle synchronization. This code is complex, error-prone, and difficult to verify.
dagx eliminates this entirely:
- Wire up tasks with primitives: During DAG construction, tasks are connected using channels and organized into topological layers based on their dependencies
- Start everything simultaneously: When you call
run(), all tasks spawn at once—there's no scheduler deciding when each task should start - Let the primitives handle coordination: Channels naturally enforce execution order—tasks wait on their input channels until upstream tasks send data. No custom orchestration needed.
- Runtime joins on completion: The runtime simply spawns all tasks and waits for the completion channel to close. That's it.
The Implementation
Under the hood, dagx uses:
- Oneshot channels: Each edge in the DAG gets a
futures::channel::oneshot- producer sends once, consumer receives once - Ownership model: Tasks take ownership (
self) and are consumed during execution - no Mutex needed for task state - Direct data flow: Outputs flow from producer to consumer via channels, never stored in shared memory during execution
- Fast-path optimization: Single-task layers execute inline without spawning overhead
The core execution logic in run() is remarkably simple:
// Create oneshot channel for each edge
for in edges
// Execute layer by layer
for layer in layers
// Tasks coordinate themselves via channels - no scheduler needed
No state machines. No task queues. No wake-up logic. No Mutex locks. Just channels doing what channels do, with an inline fast-path for sequential execution.
Inline Execution Fast-Path
Performance optimization for sequential workloads: When a layer contains only a single task (common in deep chains and linear pipelines), dagx executes it inline rather than spawning it. This eliminates spawning overhead, context switching, and channel creation, resulting in 10-100x performance improvements for sequential patterns.
Panic handling guarantee: To maintain behavioral consistency between inline and spawned execution, panics in inline tasks are caught using FutureExt::catch_unwind() and converted to errors. This matches the behavior of all major async runtimes (Tokio, async-std, smol, embassy-rs), which catch panics in spawned tasks and convert them to JoinError or equivalent.
Why this matters:
- Spawned tasks (layer.len() > 1): Runtime catches panics automatically → becomes error
- Inline tasks (layer.len() == 1): We catch panics manually → becomes error
- Result: Identical behavior regardless of execution path
This ensures your code behaves the same whether a task runs inline or spawned, making dagx's optimizations transparent and predictable.
Benefits
Simplicity: The runtime is straightforward: create channels, spawn tasks, let them coordinate via awaiting. No complex scheduler code to maintain, debug, or optimize.
Reliability: Built on battle-tested primitives (oneshot channels, async/await) from Rust's standard library and the futures crate. These have been used in production by thousands of projects and are orders of magnitude more reliable than custom scheduling logic.
Bug resistance: Fewer moving parts means fewer places for bugs to hide. The type system enforces correct wiring at compile time. Channels handle synchronization. The ownership model prevents data races. What's left to break?
Performance: Near zero-overhead. No Mutex locks during execution. Arc reference counting for efficient fanout (atomic operations, not locks). Oneshot channels are often lock-free. Tasks start as soon as their dependencies complete - maximum parallelism.
Auditability: Want to verify correctness? Check the channel wiring, verify tasks await their inputs, done. No need to trace through complex state machine transitions or wake-up cascades.
The Insight
The key insight is that dependencies ARE the schedule. If task B depends on task A's output, a channel naturally enforces that B waits for A. The dependency graph already encodes all the scheduling information—we just need to wire up channels to match it.
This is dagx's core philosophy: leverage the type system for correctness, use primitives for coordination, and let the compiler optimize everything else away.
Measured Overhead
How much overhead does this approach actually add? Benchmarks on an AMD 7840U (Zen 4 laptop CPU) show:
DAG Construction:
- Empty DAG creation: ~20 nanoseconds
- Adding tasks: ~96 nanoseconds per task
- Building a 10,000-task DAG: ~1.04 milliseconds (104 ns/task)
Execution Overhead (framework coordination, excluding actual task work):
- Sequential workloads: ~0.78 microseconds per task (inline execution fast-path)
- Parallel workloads: ~1.2 microseconds per task
- 100-task deep chain: ~78 microseconds total
- 100 independent tasks: ~122 microseconds total
- 10,000 independent tasks: ~11.2 milliseconds total
Scaling: Sub-microsecond per-task overhead across all workload patterns. Linear scaling verified to 10k+ tasks.
Comparison: dagx is 4-148x faster than dagrs (v0.5) across all benchmark patterns (see detailed benchmarks section).
The primitives-as-scheduler approach with inline fast-path optimization delivers exceptional performance: coordination overhead is sub-microsecond per task, and for real-world workloads where tasks do meaningful work (I/O, computation, etc.), framework overhead is negligible—typically well under 1% of total execution time.
Quick Start
Add to your Cargo.toml:
[]
= "0.1"
= { = "1", = ["macros", "rt-multi-thread"] }
Basic example:
use ;
// Define tasks with the #[task] macro
;
;
async
Comparison with Similar Projects
The Rust ecosystem offers several DAG execution libraries, each optimized for different use cases. This comparison helps you choose the right tool for your needs.
Developer Experience: API Comparison
The same task (compute 2 + 3) implemented across different libraries shows how API complexity varies:
dagx (Simple: macro + builder):
let dag = new;
let x = dag.add_task;
let y = dag.add_task;
let sum = dag.add_task.depends_on;
dag.run.await?;
dagrs (Complex: traits + channels + IDs + manual wiring):
let mut table = new;
let node1 = with_action;
let id1 = node1.id;
let node2 = with_action;
let id2 = node2.id;
let node3 = with_action;
let id3 = node3.id;
let mut graph = new;
graph.add_node; graph.add_node; graph.add_node;
graph.add_edge; graph.add_edge;
graph.set_env; graph.start.unwrap;
async_dag (Medium: slots + indices):
let mut graph = new;
let x = graph.add_task;
let y = graph.add_task;
let sum = graph.add_child_task?;
graph.update_dependency?; // Must specify slot index
let sum = graph.add_child_task?;
graph.update_dependency?;
Key differences:
- dagx: Type-safe dependencies, automatic wiring, no manual ID tracking, minimal boilerplate
- dagrs: Manual channel management, node ID tracking, Content wrapping, Action trait boilerplate
- async_dag: Slot indices must be tracked manually, dependencies updated separately
Quick Comparison
| Project | License | Runtime | Type Safety | API Complexity | Performance vs dagx | Key Features |
|---|---|---|---|---|---|---|
| dagx | MIT | Any async runtime | Compile-time | Simple | Baseline (see benchmarks) | Primitives-as-scheduler, inline fast-path, automatic Arc wrapping, up to 8 deps/task |
| dagrs (v0.5) | MIT/Apache-2.0 | Creates own runtime | Runtime (async_trait) | Complex | 4-148x slower across all patterns | Flow-based Programming, cyclic graphs, loops, conditional nodes, YAML config |
| async_dag | MIT | Any async runtime | Compile-time | Medium | No benchmarks | Slot-based dependencies, Graph/TryGraph modes, maximum parallelism |
| dag-flow | MIT/Apache-2.0 | Any async runtime | Runtime (HashMap) | Complex | No benchmarks | Experimental, all tasks run simultaneously, weak dependencies |
| RenovZ/dag-runner | MIT | Tokio only | Unclear | Simple | No benchmarks | Edge-based API, cycle detection, stops on first error |
| tasksitter | Unspecified | Unclear | Unclear | Medium | No benchmarks | Cyclic graphs, dynamic runtime modification, pause/resume |
GitHub stars (as of 2025): dagrs (449), async_dag (25), dag-flow (2), others (0-2)
Performance Benchmarks vs dagrs
Direct comparison benchmarks (lower is better):
| Workload Pattern | dagx | dagrs (v0.5) | dagx Performance |
|---|---|---|---|
| Linear pipeline (5 sequential tasks) | 3.0 µs | 446.1 µs | 148x faster 🚀 |
| ETL pipeline (realistic extract-transform-load) | 26.2 µs | 460.4 µs | 17.6x faster ✓ |
| Deep chain (100 purely sequential tasks) | 78.2 µs | 801.5 µs | 10.2x faster ✓ |
| Wide fanout (1→100 broadcast) | 161.0 µs | 665.5 µs | 4.1x faster ✓ |
| Large scale (10,000 independent tasks) | 11.2 ms | 13.6 ms | 1.2x faster ✓ |
Summary: dagx is 4-148x faster than dagrs across all benchmark patterns. The inline fast-path optimization eliminates spawning overhead for sequential workloads while maintaining excellent parallel performance.
How does dagx achieve this? The inline fast-path detects single-task layers (common in sequential chains) and executes them directly without spawning overhead. For parallel workloads, tasks still spawn to maximize concurrency. This adaptive execution strategy combines the best of both worlds: sub-microsecond overhead for sequential work, true parallelism for concurrent work.
Key insight: Most real-world DAGs mix sequential and parallel patterns. dagx automatically optimizes for both, delivering 4-148x better performance than dagrs regardless of workload shape.
Benchmarks run on AMD Ryzen 7 7840U (Zen 4) @ 3.3GHz. Run cargo bench to test on your hardware.
Detailed Comparison
dagrs (Most Mature)
Best for: Complex workflows requiring advanced flow control, Tokio-based applications, machine learning pipelines.
Strengths:
- Most mature (449 GitHub stars, active community)
- Rich feature set: Flow-based Programming, cyclic graphs, loops, conditional nodes
- YAML configuration support for declarative workflows
- Designed for complex orchestration patterns
Trade-offs:
- Creates own Tokio runtime internally (not runtime-agnostic, cannot be nested)
- More complex API:
Actiontrait,InChannels/OutChannels,NodeTable,Contentwrappers, manual node ID tracking - Uses
async_traitfor type erasure (runtime overhead) - 4-148x slower than dagx across all benchmark patterns (see comparison benchmarks)
API Style (v0.5):
// Define action with channels for communication
let mut table = new;
let node = with_action;
let node_id = node.id; // Must capture ID before moving
graph.add_node;
graph.add_edge;
graph.set_env; graph.start.unwrap;
async_dag (Clean Type Safety)
Best for: Runtime flexibility with compile-time type safety, fail-fast workflows.
Strengths:
- Runtime-agnostic (works with any async runtime)
- Compile-time type checking on task connections
- Both standard (
Graph) and fail-fast (TryGraph) modes - Designed for maximum parallelism
Trade-offs:
- Medium API complexity: slot-based dependency management
- Must manually specify slot indices (0, 1, etc.) when connecting tasks
- Less mature (25 stars)
- No performance benchmarks
API Style:
let mut graph = new;
let _1 = graph.add_task;
let _2 = graph.add_task;
// add_child_task with slot index
let _3 = graph.add_child_task.unwrap;
graph.update_dependency.unwrap; // Specify slot 1
graph.run.await;
dag-flow (Experimental)
Best for: Experimental projects, flexible dependency awaiting patterns.
Strengths:
- Runtime-agnostic
- All tasks run simultaneously (not in dependency layers)
- Weak dependencies support
- Flexible input awaiting at any point in task execution
Trade-offs:
- Explicitly experimental
- Runtime type safety via
HashMap<String, Input> - Complex API: implement
Tasktrait withid(),dependencies(),run() - Named dependencies (string-based lookup)
- Very early stage (2 stars)
API Style:
RenovZ/dag-runner (Simple Edge-Based)
Best for: Simple DAGs with Tokio, straightforward edge-based dependencies.
Strengths:
- Simple API:
add_vertex(),add_edge() - Cycle detection
- Stops on first error
Trade-offs:
- Requires Tokio runtime
- Manual channel setup for task communication
- Type safety model unclear
- Very early stage (0 stars)
API Style:
let mut dag = default;
dag.add_vertex;
dag.add_edge;
dag.run.await?;
tasksitter (Dynamic Workflows)
Best for: Dynamic workflow modification, cyclic graphs, runtime introspection.
Strengths:
- Supports cyclic graphs (not just DAGs)
- Dynamic graph modification at runtime
- Pause/resume capabilities
- Graph introspection
Trade-offs:
- Limited documentation
- Runtime and type safety model unclear
- Very early stage (0 stars)
When to Choose dagx
Choose dagx when you value:
- Performance: 4-148x faster than dagrs across all workload patterns (see benchmarks)
- Runtime flexibility: Works with Tokio, async-std, smol, or any async runtime
- Compile-time safety: Full type safety with no runtime type errors in the public API
- Minimal overhead: ~0.78µs per task (sequential), ~1.2µs per task (parallel)
- Simple, ergonomic API:
#[task]macro,add_task(),depends_on()- that's it - Automatic optimizations: Arc wrapping, inline execution, adaptive spawning - all transparent
- Predictable performance: Linear scaling, no hidden complexity, consistent sub-µs overhead
dagx is not the right choice if you need:
- Cyclic graphs or dynamic flow control (loops, conditions) → Consider dagrs or tasksitter
- More than 8 dependencies per task → Consider dagrs or async_dag
- YAML-based configuration → Consider dagrs
When to Consider Alternatives
- Choose dagrs if you need advanced flow control (loops, conditionals, cyclic graphs), YAML configuration, or are already committed to Tokio and want a mature, feature-rich solution
- Choose async_dag if you want compile-time type safety with runtime flexibility and the slot-based API appeals to you
- Choose dag-flow if you're building experimental projects and the all-tasks-run-simultaneously model fits your use case
- Choose RenovZ/dag-runner if you need the simplest possible edge-based API and are already using Tokio
- Choose tasksitter if you need dynamic graph modification at runtime or cyclic workflow support
Core Concepts
Task
A Task is a unit of async work with typed inputs and outputs. Use the #[task] macro to define tasks:
use ;
;
Task Patterns
dagx supports three task patterns:
1. Stateless - Pure functions with no state:
;
2. Read-only state - Configuration accessed via &self:
;
3. Mutable state - State modification via &mut self:
;
DagRunner
The DagRunner orchestrates task execution:
let dag = new;
let handle = dag.add_task;
TaskHandle
A TaskHandle<T> is a typed reference to a task's output. Use it to wire dependencies and retrieve results:
// Single dependency
let task = dag.add_task.depends_on;
// Multiple dependencies (order matters!)
let task = dag.add_task.depends_on;
Examples
Fan-out Pattern (1 → n)
One task produces a value consumed by multiple downstream tasks:
use ;
;
;
;
async
Fan-in Pattern (m → 1)
Multiple tasks produce values consumed by a single downstream task:
use ;
;
;
;
;
async
Many-to-Many Pattern (m ↔ n)
Complex DAGs with multiple layers:
use ;
;
;
;
async
Type-State Pattern
The API uses the type-state pattern to ensure correctness at compile time:
use ;
;
;
// ✅ This compiles
let task = dag.add_task.depends_on; // upstream produces i32
// ❌ This doesn't compile - type mismatch!
let task = dag.add_task.depends_on; // Error: expected String, got i32
Runtime Agnostic
dagx works with any async runtime. Provide a spawner function to run():
// With Tokio
dag.run.await.unwrap;
// With async-std
dag.run.await.unwrap;
// With smol
dag.run.await.unwrap;
Important Limitations
Tasks Cannot Return Bare Tuples
Tasks cannot return bare tuples as output types. This is a technical limitation of the current implementation. If you need to return multiple values from a task, use one of these workarounds:
Option 1: Use a struct (recommended)
;
Option 2: Wrap in Result
;
Structs are preferred because they're self-documenting and easier to refactor.
Examples
The examples/ directory contains both tutorial examples (numbered, beginner-friendly) and reference examples (practical patterns).
Tutorial Examples
Start here if you're new to dagx. These examples build progressively:
01_basic.rs- Getting started: your first DAG02_fan_out.rs- Fan-out pattern: one task feeds many (1 → N)03_fan_in.rs- Fan-in pattern: many tasks feed one (N → 1)04_parallel_computation.rs- Map-reduce pattern with true parallelism
Run tutorial examples:
Reference Examples
Practical patterns for real-world use cases:
complex_dag.rs- Multi-layer workflows with complex dependenciesconditional_workflow.rs- Conditional execution with Result typesdata_pipeline.rs- Real-world ETL data processing pipelineerror_handling.rs- Error propagation and recovery strategiestimeout.rs- Managing task timeoutslarge_dag.rs- Performance and scalability (10,000+ nodes)parallelism_proof.rs- Proof of true parallel execution (10,000 tasks × 1s = 1s total)
Run reference examples:
When to Use dagx
dagx is ideal for:
- Data pipelines with complex dependencies between stages
- Build systems where tasks depend on outputs of other tasks
- Parallel computation where work can be split and aggregated
- Workflow engines with typed data flow between stages
- ETL processes with validation and transformation steps
Performance
dagx provides true parallel execution with sub-microsecond overhead:
- Adaptive execution: Inline fast-path for sequential work, true parallelism for concurrent work
- Tasks spawn to multiple threads via your async runtime's spawner when beneficial
- Linear scaling verified up to 10,000+ tasks
- ~0.78µs overhead per task for sequential workloads (inline execution)
- ~1.2µs overhead per task for parallel workloads
- Efficient memory usage at ~200 bytes per task
- Zero-cost abstractions through generics and compile-time monomorphization
- 4-148x faster than dagrs across all benchmark patterns (see comparison benchmarks above)
Performance Characteristics
- ✅ Exceptional for sequential chains: 10-148x faster than dagrs via inline execution
- ✅ Excellent for parallel workloads: 4-10x faster than dagrs on fan-out, ETL, large-scale
- ✅ Best-in-class for mixed parallelism: Automatically optimizes execution strategy per layer
- ✅ Sub-microsecond per-task overhead: Fast enough that framework cost is negligible
Result: dagx dominates dagrs on every benchmark pattern. No trade-offs, no compromises.
Automatic Arc Wrapping (No Manual Arc Needed!)
Task outputs are automatically wrapped in Arc<T> internally for efficient fan-out patterns. You just output T - the framework handles the Arc wrapping:
use ;
// ✅ CORRECT: Just output Vec<String>, framework handles Arc internally
;
// Downstream tasks receive &Vec<String>
// Internally, Arc<Vec<String>> is cloned (cheap), then inner Vec is extracted
;
How it works:
- Your task outputs
T - Framework wraps it in
Arc<T>internally - For fan-out (1→N), Arc is cloned N times (just pointer copies - O(1))
- Each downstream task receives
&Tafter extracting from Arc
Performance characteristics:
- Heap types (Vec, String, HashMap): Arc overhead is negligible, fanout is essentially free
- Copy types (i32, usize): Small Arc overhead (~few ns) due to atomic refcounting
- See
cargo benchfor actual measurements
Advanced - Zero-copy optimization:
If you want true zero-copy sharing (no extraction), output Arc<T> explicitly:
// Your task outputs Arc<T>
async
// Downstream receives &Arc<T> - just clones the Arc pointer
async
This becomes Arc<Arc<T>> internally, but ExtractInput unwraps one layer automatically.
Run benchmarks:
View the detailed HTML reports:
# macOS
# Linux
# Windows
# Or manually open target/criterion/report/index.html in your browser
Documentation
Full API documentation is available at docs.rs/dagx.
License
Licensed under the MIT License. See LICENSE for details.
Copyright (c) 2025 Stephen Waits steve@waits.net
Contributing
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
For security issues, see SECURITY.md.