1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
//! Startup recovery sweep.
//!
//! A worker that exits ungracefully (SIGKILL, OOM, hardware failure)
//! leaves its in-flight rows in `status = 'running'` with `locked_at`
//! pointing at the crash time. The recovery sweep runs at the next
//! worker startup, looks for rows whose `locked_at` is older than a
//! threshold (default 5 minutes), and resets them to `retrying` so the
//! queue can re-claim them.
//!
//! The threshold is the key tunable: too short and we'll reprocess a
//! still-live job; too long and a crashed worker's row blocks for that
//! duration. See `tradeoffs.md` and `docs/runbook.md` for the operator-
//! facing details.
use Duration;
use PgPool;
use info;
use crateJobError;
use cratequeue;
/// Sweep rows whose `locked_at` is older than `threshold` ago. Returns
/// the number of rows reset.
///
/// Logs an `info!` line when the count is non-zero so that operators
/// see crash-recovery activity in normal logs without having to query
/// metrics. A zero-count run is silent to keep the log floor low on
/// clean startups.
pub async