nano-watchdog 0.1.0

OS-thread watchdog for stuck task detection with RAII guards — works even when Tokio is frozen
Documentation
  • Coverage
  • 71.43%
    10 out of 14 items documented1 out of 1 items with examples
  • Size
  • Source code size: 39.31 kB This is the summed size of all the files inside the crates.io package for this release.
  • Documentation size: 4.93 MB This is the summed size of all files generated by rustdoc for all configured targets
  • Ø build duration
  • this release: 27s Average build duration of successful builds.
  • all releases: 27s Average build duration of successful builds in releases after 2024-10-23.
  • Links
  • Repository
  • crates.io
  • Dependencies
  • Versions
  • Owners
  • aovestdipaperino

nano-watchdog

OS-thread watchdog for stuck task detection with RAII guards -- works even when Tokio is frozen.

Why

When your async runtime deadlocks, everything running inside it freezes -- including your health checks, metrics exporters, and monitoring tasks. You only find out when an external probe times out or a user complains.

nano-watchdog solves this by running a dedicated OS thread that is completely independent of Tokio (or any async runtime). It keeps checking whether your tasks are making progress, and fires warnings the moment something gets stuck. Because it never touches the async runtime, it stays alive even during a full executor deadlock.

How It Works

  1. You create a TaskTracker and store it in a static (or leak a Box).
  2. You call start_watchdog(tracker, config, callback) -- this spawns a plain std::thread named "watchdog".
  3. Before each unit of work, call tracker.track("description") to get an RAII TaskGuard.
  4. Optionally call guard.set_phase("parsing") to annotate progress within a task.
  5. When the guard is dropped the task is automatically unregistered and the completed counter increments.
  6. The watchdog thread wakes at check_interval, scans for tasks exceeding stuck_threshold, and logs them via tracing::warn!.
  7. If no tasks complete for no_progress_alarm_count consecutive checks, the watchdog logs a deadlock warning and dumps all active tasks.

Installation

[dependencies]
nano-watchdog = "0.1"

Quick Start

use nano_watchdog::{TaskTracker, WatchdogConfig, start_watchdog};

fn main() {
    // Create a global tracker (leaked so it lives for 'static)
    let tracker: &'static TaskTracker = Box::leak(Box::new(TaskTracker::new()));

    // Start the watchdog thread
    start_watchdog(tracker, WatchdogConfig::default(), None);

    // Track a unit of work with an RAII guard
    {
        let guard = tracker.track("process request");
        guard.set_phase("parsing");
        // ... do work ...
        guard.set_phase("responding");
    } // guard dropped -> task automatically unregistered
}

Usage with Tokio

Because TaskTracker is Send + Sync and requires only &self, it works seamlessly from async code. Store it in a static so every task can reach it:

use nano_watchdog::{TaskTracker, WatchdogConfig, start_watchdog};
use std::sync::LazyLock;

static TRACKER: LazyLock<&'static TaskTracker> = LazyLock::new(|| {
    let tracker: &'static TaskTracker = Box::leak(Box::new(TaskTracker::new()));
    start_watchdog(tracker, WatchdogConfig::default(), None);
    tracker
});

#[tokio::main]
async fn main() {
    let guard = TRACKER.track("handle /api/transfer");
    guard.set_phase("validating");
    // ... await something ...
    guard.set_phase("committing");
    // guard drops when the future completes
}

Custom Callbacks

Pass an on_stuck closure to start_watchdog to integrate with your alerting pipeline. The closure receives a slice of StuckTaskInfo each time stuck tasks are detected:

use nano_watchdog::{TaskTracker, WatchdogConfig, start_watchdog};
use std::sync::atomic::{AtomicUsize, Ordering};
use std::sync::Arc;
use std::time::Duration;

fn main() {
    let tracker: &'static TaskTracker = Box::leak(Box::new(TaskTracker::new()));
    let alert_count = Arc::new(AtomicUsize::new(0));
    let counter = alert_count.clone();

    start_watchdog(
        tracker,
        WatchdogConfig {
            stuck_threshold: Duration::from_secs(1),
            check_interval: Duration::from_millis(500),
            ..Default::default()
        },
        Some(Box::new(move |stuck_tasks| {
            counter.fetch_add(1, Ordering::Relaxed);
            for task in stuck_tasks {
                eprintln!(
                    "[ALERT] \"{}\" phase={} stuck for {:.1}s",
                    task.description, task.phase, task.duration.as_secs_f64()
                );
            }
        })),
    );

    let _guard = tracker.track("POST /api/transfer");
    std::thread::sleep(Duration::from_secs(3));

    println!("Alerts fired: {}", alert_count.load(Ordering::Relaxed));
}

Multi-threaded Usage

TaskTracker uses DashMap internally, so tracking is lock-free and safe to call from any number of threads simultaneously:

use nano_watchdog::{TaskTracker, WatchdogConfig, start_watchdog};
use std::time::Duration;

fn main() {
    let tracker: &'static TaskTracker = Box::leak(Box::new(TaskTracker::new()));
    start_watchdog(tracker, WatchdogConfig::default(), None);

    let mut handles = Vec::new();
    for thread_id in 0..8 {
        handles.push(std::thread::spawn(move || {
            for task_id in 0..50 {
                let guard = tracker.track(
                    &format!("worker-{} task-{}", thread_id, task_id),
                );
                guard.set_phase("processing");
                std::thread::sleep(Duration::from_millis(5));
                guard.set_phase("complete");
            }
        }));
    }

    for h in handles {
        h.join().unwrap();
    }

    println!("Active: {}, Completed: {}",
        tracker.active_count(), tracker.completed_count());
}

Configuration

WatchdogConfig controls the watchdog thread's behavior:

Field Type Default Description
stuck_threshold Duration 5 s How long a task must run before it is considered stuck.
check_interval Duration 1 s How often the watchdog thread wakes to scan for stuck tasks.
no_progress_alarm_count u32 5 Consecutive zero-progress checks before a deadlock warning fires.

API Reference

TaskTracker

Method Returns Description
TaskTracker::new() TaskTracker Create a new tracker.
track(&self, description: &str) TaskGuard<'_> Start tracking a task. Returns an RAII guard.
active_count(&self) usize Number of currently active (in-flight) tasks.
completed_count(&self) u64 Total tasks completed since creation.
check_stuck_tasks(&self, max_duration: Duration) Vec<StuckTaskInfo> Find tasks running longer than max_duration.
dump_all_tasks(&self) String Formatted dump of all active tasks for diagnostics.

TaskGuard<'a>

Method Returns Description
set_phase(&self, phase: &str) () Update the current phase annotation of this task.
id(&self) u64 Get the unique task ID.
(drop) -- Automatically unregisters the task on drop.

StuckTaskInfo

Field Type Description
task_id u64 Unique identifier of the stuck task.
description String The description passed to track().
phase String Last phase set via set_phase().
duration Duration How long the task has been running.

start_watchdog

pub fn start_watchdog(
    tracker: &'static TaskTracker,
    config: WatchdogConfig,
    on_stuck: Option<Box<dyn Fn(&[StuckTaskInfo]) + Send + 'static>>,
)

Spawns the watchdog on a dedicated OS thread named "watchdog". Safe to call multiple times -- subsequent calls are no-ops.

Examples

The crate ships with three runnable examples:

# Basic stuck-task detection
cargo run --example basic

# Custom callback for alerting
cargo run --example callback

# Concurrent tracking from 8 threads x 50 tasks
cargo run --example multi_thread

Design Principles

  • No Tokio dependency -- runs on a plain std::thread, so it survives async runtime freezes.
  • Zero-cost when idle -- the watchdog thread sleeps between checks; no busy-polling.
  • RAII safety -- TaskGuard ensures tasks are always unregistered, even on panic unwinds.
  • Minimal API surface -- five public types, one function. Learn it in minutes.
  • Observable -- integrates with tracing for structured logging; optional callback for custom alerting.

License

MIT