Install
[]
= "0.0"
With features:
[]
= { = "0.0", = ["controller", "logging"] }
See crates.io for the latest version.
Taskvisor
Lightweight, event-driven task supervision for async Rust.
A small orchestrator that runs your background tasks, restarts them on failure, and tells you what happened.
┌────────────┐ runs & restarts ┌──────────────┐
│ Tasks │ <───────────────────── │ Supervisor │
└────────────┘ └──────┬───────┘
|
emits events
|
┌──────────────┐
│ Subscribers │ <─ your metrics / logs
└──────────────┘
Use it for long-lived jobs, background workers, or anything that must stay alive, observable, and restart safely.
Quick start
A task that prints "pong" every 10 seconds, restarts forever, and shuts down on Ctrl+C:
use Duration;
use CancellationToken;
use ;
async
Core concepts
Task & TaskFn — A Task is any Send + Sync + 'static type that implements fn spawn(&self, ctx: CancellationToken) -> BoxTaskFuture. TaskFn wraps a closure into a Task so you don't need a struct. TaskRef is just Arc<dyn Task>.
TaskSpec — Bundles a task with its policies: restart, backoff, and optional timeout. This is what you pass to the supervisor.
RestartPolicy — Controls when a task restarts after it exits:
| Policy | Behavior |
|---|---|
Never |
Run once, done. |
OnFailure (default) |
Restart only on error. Success = stop. |
Always { interval } |
Always restart. interval: Some(10s) waits between successes. |
BackoffPolicy — Controls retry delay after failure. Delay for attempt n = first * factor^n, capped at max, then jitter is applied:
| Field | Default | Meaning |
|---|---|---|
first |
100ms |
Initial delay |
max |
30s |
Delay cap |
factor |
1.0 |
Multiplier per attempt (2.0 = exponential) |
jitter |
None |
Randomization strategy (see below) |
JitterPolicy — Prevents thundering herd when multiple tasks retry at the same time:
| Policy | Range | Use when |
|---|---|---|
None |
exact delay | Single task, predictable timing |
Full |
[0, delay] |
Maximum spread needed |
Equal (recommended) |
[delay/2, delay] |
Balanced: preserves ~75% of backoff |
Decorrelated |
[base, base*3] capped at max |
Sophisticated, self-adjusting |
Supervisor — Owns the runtime. Spawns task actors, distributes events to subscribers, handles shutdown.
Events & Subscribe — Every lifecycle change (start, stop, fail, timeout, backoff, shutdown) is published to a broadcast bus. Implement Subscribe to observe them. Each subscriber gets its own queue — a slow subscriber never blocks others.
Recipes
Exponential backoff with jitter
use Duration;
use ;
let backoff = BackoffPolicy ;
Per-task timeout
// Task gets 5s per attempt. If exceeded: TimeoutHit event + TaskError::Timeout + retry.
let spec = new;
Custom subscriber (metrics)
use ;
use Arc;
use async_trait;
use ;
// Pass to supervisor:
let metrics = new;
let sup = new;
Periodic task (run every N seconds)
// Runs, completes, waits 30s, runs again. Forever.
let spec = new;
Runtime task management
// Add/remove/cancel while running
sup.add_task.await?;
sup.cancel.await?;
sup.remove_task.await?;
let alive = sup.is_alive.await;
let tasks = sup.list_tasks.await;
Error handling
Return these from your task to control what happens next:
| Return | Retryable | What happens |
|---|---|---|
Ok(()) |
- | Task completed. RestartPolicy decides next step. |
Err(TaskError::Fail { reason }) |
Yes | Retryable failure. Backoff, then retry. |
Err(TaskError::Timeout { .. }) |
Yes | Set automatically when per-task timeout is exceeded. |
Err(TaskError::Fatal { reason }) |
No | Permanent failure. Actor stops, publishes ActorDead. |
Err(TaskError::Canceled) |
No | Graceful shutdown. Not an error. |
Cancellation pattern
Tasks must observe cancellation. Two equivalent patterns:
// Pattern 1: select! (recommended for long-running tasks)
select!
// Pattern 2: check before work (ok for short tasks)
if ctx.is_cancelled
do_work.await
Events
Every lifecycle change publishes an Event to the bus. Subscribe to observe them.
| Event | Meaning |
|---|---|
| Lifecycle | |
TaskStarting |
Attempt is beginning |
TaskStopped |
Completed or cancelled gracefully |
TaskFailed |
Attempt failed (retryable or fatal) |
TimeoutHit |
Attempt exceeded its timeout |
BackoffScheduled |
Retry delay scheduled (includes delay duration) |
| Terminal | |
ActorExhausted |
Restart policy says stop (normal end-of-life) |
ActorDead |
Fatal error, actor will not restart |
| Management | |
TaskAdded / TaskRemoved |
Task registered / unregistered |
TaskAddRequested / TaskRemoveRequested |
Add/remove commands received |
| Shutdown | |
ShutdownRequested |
OS signal caught (SIGTERM/SIGINT) |
AllStoppedWithinGrace |
Clean shutdown |
GraceExceeded |
Some tasks didn't stop in time |
| Subscriber | |
SubscriberPanicked |
A subscriber panicked (isolated, others unaffected) |
SubscriberOverflow |
Subscriber queue full, event dropped |
Each event carries: kind, task (name), attempt, reason, delay_ms, timeout_ms, seq (monotonic ordering), at (timestamp).
Configuration
use SupervisorConfig;
let mut cfg = default;
cfg.grace = from_secs; // shutdown grace period
cfg.timeout = from_secs; // default per-task timeout (0 = none)
cfg.max_concurrent = 4; // task concurrency limit (0 = unlimited)
cfg.bus_capacity = 2048; // event bus ring buffer size
cfg.restart = OnFailure;
cfg.backoff = default;
| Field | Default | Meaning |
|---|---|---|
grace |
60s |
How long to wait for tasks to stop on shutdown |
timeout |
0s (none) |
Default per-task attempt timeout |
max_concurrent |
0 (unlimited) |
Global semaphore for running tasks |
bus_capacity |
1024 |
Broadcast channel size. Slow subscribers see Lagged |
restart |
OnFailure |
Default restart policy for tasks |
backoff |
100ms / 1.0x / 30s max |
Default backoff for tasks |
Controller (feature = controller)
Slot-based admission control. Tasks submit to named slots; the policy decides what happens when a slot is busy.
| Policy | Behavior |
|---|---|
Queue |
FIFO queue. New task waits until current one finishes. |
Replace |
Cancels running task, starts the new one immediately. |
DropIfRunning |
Rejects submission if slot is already busy. |
use ;
let sup = builder
.with_controller
.build;
// Submit with admission policy
sup.submit.await?;
sup.submit.await?;
sup.submit.await?;
How it works
TaskFn (your async code)
|
v
TaskSpec (task + RestartPolicy + BackoffPolicy + timeout)
|
v
Supervisor
├── Registry
│ └── TaskActor (per task)
│ ├── attempt loop
│ │ ├── run task with timeout
│ │ ├── apply RestartPolicy
│ │ └── apply BackoffPolicy on failure
│ └── publish events to Bus
└── Bus
└── Subscribers (your metrics, logs, alerts)
Examples
| Example | What it shows |
|---|---|
| subscriber.rs | Custom event subscriber tracking task metrics |
| control.rs | Add, remove, cancel tasks at runtime |
| controller.rs | Admission policies: Queue, Replace, DropIfRunning |
Optional features
| Feature | What it enables |
|---|---|
controller |
Controller, ControllerSpec, ControllerConfig, AdmissionPolicy |
logging |
Built-in LogWriter subscriber that prints events to stdout (demo only) |
Contributing
Found a bug? Have an idea? Pull requests and issues are welcome.