nano-watchdog
OS-thread watchdog for stuck task detection with RAII guards -- works even when Tokio is frozen.
Why
When your async runtime deadlocks, everything running inside it freezes -- including your health checks, metrics exporters, and monitoring tasks. You only find out when an external probe times out or a user complains.
nano-watchdog solves this by running a dedicated OS thread that is completely
independent of Tokio (or any async runtime). It keeps checking whether your tasks
are making progress, and fires warnings the moment something gets stuck. Because it
never touches the async runtime, it stays alive even during a full executor deadlock.
How It Works
- You create a
TaskTrackerand store it in astatic(or leak aBox). - You call
start_watchdog(tracker, config, callback)-- this spawns a plainstd::threadnamed"watchdog". - Before each unit of work, call
tracker.track("description")to get an RAIITaskGuard. - Optionally call
guard.set_phase("parsing")to annotate progress within a task. - When the guard is dropped the task is automatically unregistered and the completed counter increments.
- The watchdog thread wakes at
check_interval, scans for tasks exceedingstuck_threshold, and logs them viatracing::warn!. - If no tasks complete for
no_progress_alarm_countconsecutive checks, the watchdog logs a deadlock warning and dumps all active tasks.
Installation
[]
= "0.1"
Quick Start
use ;
Usage with Tokio
Because TaskTracker is Send + Sync and requires only &self, it works
seamlessly from async code. Store it in a static so every task can reach it:
use ;
use LazyLock;
static TRACKER: = new;
async
Custom Callbacks
Pass an on_stuck closure to start_watchdog to integrate with your alerting
pipeline. The closure receives a slice of StuckTaskInfo each time stuck tasks
are detected:
use ;
use ;
use Arc;
use Duration;
Multi-threaded Usage
TaskTracker uses DashMap internally, so tracking is lock-free and safe to call
from any number of threads simultaneously:
use ;
use Duration;
Configuration
WatchdogConfig controls the watchdog thread's behavior:
| Field | Type | Default | Description |
|---|---|---|---|
stuck_threshold |
Duration |
5 s | How long a task must run before it is considered stuck. |
check_interval |
Duration |
1 s | How often the watchdog thread wakes to scan for stuck tasks. |
no_progress_alarm_count |
u32 |
5 | Consecutive zero-progress checks before a deadlock warning fires. |
API Reference
TaskTracker
| Method | Returns | Description |
|---|---|---|
TaskTracker::new() |
TaskTracker |
Create a new tracker. |
track(&self, description: &str) |
TaskGuard<'_> |
Start tracking a task. Returns an RAII guard. |
active_count(&self) |
usize |
Number of currently active (in-flight) tasks. |
completed_count(&self) |
u64 |
Total tasks completed since creation. |
check_stuck_tasks(&self, max_duration: Duration) |
Vec<StuckTaskInfo> |
Find tasks running longer than max_duration. |
dump_all_tasks(&self) |
String |
Formatted dump of all active tasks for diagnostics. |
TaskGuard<'a>
| Method | Returns | Description |
|---|---|---|
set_phase(&self, phase: &str) |
() |
Update the current phase annotation of this task. |
id(&self) |
u64 |
Get the unique task ID. |
| (drop) | -- | Automatically unregisters the task on drop. |
StuckTaskInfo
| Field | Type | Description |
|---|---|---|
task_id |
u64 |
Unique identifier of the stuck task. |
description |
String |
The description passed to track(). |
phase |
String |
Last phase set via set_phase(). |
duration |
Duration |
How long the task has been running. |
start_watchdog
Spawns the watchdog on a dedicated OS thread named "watchdog". Safe to call
multiple times -- subsequent calls are no-ops.
Examples
The crate ships with three runnable examples:
# Basic stuck-task detection
# Custom callback for alerting
# Concurrent tracking from 8 threads x 50 tasks
Design Principles
- No Tokio dependency -- runs on a plain
std::thread, so it survives async runtime freezes. - Zero-cost when idle -- the watchdog thread sleeps between checks; no busy-polling.
- RAII safety --
TaskGuardensures tasks are always unregistered, even on panic unwinds. - Minimal API surface -- five public types, one function. Learn it in minutes.
- Observable -- integrates with
tracingfor structured logging; optional callback for custom alerting.
License
MIT