# Worker Processes
Workers are the processes that generate load for scenarios. They run
inside the VM, each in its own cgroup.
## Fork, not threads
Workers are `fork()`ed processes. Cgroups operate on PIDs, so each
worker must be a separate process to be independently placed in a
cgroup.
## Two-phase start
Workers wait on a pipe for a "start" signal after fork:
1. Parent forks the worker.
2. Worker installs SIGUSR1 handler, then blocks on pipe read.
3. Parent moves the worker to its target cgroup.
4. Parent writes to the pipe, signaling the worker to start.
This ensures workers run inside their target cgroup from the first
instruction of their workload.
### Custom work types
`WorkType::Custom` workers follow the same two-phase start (fork,
cgroup placement, start signal), and the framework applies affinity
and scheduling policy before handing control to the user function.
After setup, the `run` function pointer takes over entirely --
the framework work loop is bypassed.
## Stop protocol
Workers install a SIGUSR1 handler that sets an atomic `STOP` flag. The
main work loop checks this flag each iteration. On stop:
1. Parent sends SIGUSR1 to all workers.
2. Workers exit their work loop.
3. Workers serialize their `WorkerReport` to a pipe.
4. Parent reads reports and waits for child exit.
## Telemetry
Each worker produces a `WorkerReport`:
```rust,ignore
pub struct WorkerReport {
pub tid: i32,
pub work_units: u64,
pub cpu_time_ns: u64,
pub wall_time_ns: u64,
pub off_cpu_ns: u64,
pub migration_count: u64,
pub cpus_used: BTreeSet<usize>,
pub migrations: Vec<Migration>,
pub max_gap_ms: u64,
pub max_gap_cpu: usize,
pub max_gap_at_ms: u64,
pub resume_latencies_ns: Vec<u64>,
pub wake_sample_total: u64,
pub iteration_costs_ns: Vec<u64>,
pub iteration_cost_sample_total: u64,
pub iterations: u64,
pub schedstat_run_delay_ns: u64,
pub schedstat_run_count: u64,
pub schedstat_cpu_time_ns: u64,
pub completed: bool,
pub numa_pages: BTreeMap<usize, u64>,
pub vmstat_numa_pages_migrated: u64,
pub exit_info: Option<WorkerExitInfo>,
pub is_messenger: bool,
pub group_idx: usize,
pub affinity_error: Option<String>,
}
pub enum WorkerExitInfo {
Exited(i32),
Signaled(i32),
TimedOut,
WaitFailed(String),
/// Thread-mode worker panicked. Exclusive to `CloneMode::Thread`;
/// fork workers surface panics via `Exited(1)` or
/// `Signaled(SIGABRT)` depending on the panic strategy.
Panicked(String),
}
```
`iteration_costs_ns` mirrors `resume_latencies_ns` for per-iteration
wall-clock cost: a reservoir-sampled vector capped at
`MAX_WAKE_SAMPLES` entries, paired with `iteration_cost_sample_total`
for the total observation count when the cap is exceeded.
`group_idx` is `0` for the primary group and `1..=N` for composed
`WorkSpec` entries in declaration order (mirrors
`WorkloadConfig::composed`). `affinity_error` is `Some(reason)`
when the worker's `sched_setaffinity` / `mbind` setup failed; the
worker still runs and produces a report but the field documents
the divergence from the requested affinity contract.
Three fields worth calling out explicitly:
- `wake_sample_total` — the TOTAL number of wake-latency
observations the worker saw, including samples the reservoir
sampler dropped. `resume_latencies_ns` is clamped to at most
100_000 entries (`MAX_WAKE_SAMPLES`); on a long run that
accumulates more wakes than the cap, the vector stays at the
cap while this counter keeps climbing. Host-side consumers
reporting "total wakeups observed" read `wake_sample_total`;
percentile / CV computations read `resume_latencies_ns`.
- `completed` — `true` when the worker reached its natural end
(outer loop observed STOP and exited cleanly, or a custom-
closure payload returned from its `run`). Sentinel reports
synthesised by `stop_and_collect`'s JSON-parse fallback carry
`false`. Lets consumers distinguish "ran to completion, saw
zero iterations" from "died / timed out before recording
anything."
- `is_messenger` — `true` only for the messenger worker in a
`FutexFanOut` / `FanOutCompute` group (the single writer that
advances the shared generation and issues `futex_wake`).
Enables per-worker latency-participation assertions —
receivers produce `resume_latencies_ns` entries, messengers
record wake-side work but no resume latency.
- `off_cpu_ns = wall_time_ns - cpu_time_ns`
- `exit_info` is `None` on every live-worker-authored report.
`stop_and_collect` synthesises a sentinel `WorkerReport` with
`Some(_)` when the worker handed back no (or unparseable) JSON,
using the `WorkerExitInfo` enum
(`Exited(code)` / `Signaled(signum)` / `TimedOut` /
`WaitFailed(String)` — the string carries the underlying `waitpid`
errno rendering) to preserve the reap shape for post-mortem.
- Migrations are tracked every 1024 work units: after each outer
iteration the worker checks `work_units.is_multiple_of(1024)`
and runs the migration-detect body iff that is true. The check
runs exactly once per outer iteration, so the effective period
in outer iterations is
`1024 / gcd(units_per_iter, 1024)`. Default parameters assumed
unless noted:
- **Every outer iteration (period = 1 iter)**: SpinWait (1024),
Mixed (1024), Bursty (each outer iter runs `spin_burst(1024)`
some number of times inside the `burst_ms` loop — always a
multiple of 1024), PipeIo (`burst_iters`=1024), FutexPingPong
(`spin_iters`=1024), CachePressure (1024 strided RMW steps),
CacheYield (1024 strided RMW steps), CachePipe
(`burst_iters`=1024), FutexFanOut messenger AND receiver
(both call `spin_burst(spin_iters)` before splitting roles;
default 1024), AffinityChurn (`spin_iters`=1024), PolicyChurn
(`spin_iters`=1024).
- **Every 2 iterations**: NiceSweep (`spin_burst(512)` per iter
→ `gcd(512, 1024) = 512`).
- **Every 4 iterations**: MutexContention
(`work_iters`=1024 + `hold_iters`=256 = 1280 per acquire+
release → `gcd(1280, 1024) = 256`, period = 4 iters).
FanOutCompute messenger (`spin_burst(256)` per wake cycle
→ same 256-unit gcd).
- **Every 16 iterations**: PageFaultChurn — one persistent
`MAP_PRIVATE | MAP_ANONYMOUS` region per worker (default
4 MiB via `region_kb`=4096), re-faulted each outer
iteration via `madvise(MADV_DONTNEED)`. Each iteration
contributes `touches_per_cycle`=256 page writes (each first
write after `MADV_DONTNEED` triggers a minor fault; a
birthday-collision xorshift64 index may revisit a page
already faulted this cycle, so the fault count is a ceiling,
not a floor) + `spin_iters`=64 = 320 work units
(`gcd(320, 1024) = 64`).
- **Every 64 iterations**: IoSyncWrite (16 4-KiB writes per
write-then-sleep pair → `gcd(16, 1024) = 16`); IoRandRead and
IoConvoy use the same 64-iteration cadence for their per-iteration
pread/pwrite mixes.
- **Every 1024 iterations**: YieldHeavy (1 unit per yield),
ForkExit (1 unit per fork+wait), FanOutCompute worker
(`operations`=5 matrix multiplies per wake, one `work_units`
tick per multiply → `gcd(5, 1024) = 1`).
- **Phase-inherited**: Sequence inherits whichever phase is
currently active — Spin / Yield / Io use the same per-unit
accounting as the SpinWait / YieldHeavy / IoSyncWrite groups
above; Sleep contributes no `work_units` and so pauses migration
checks while it runs.
- **Not tracked by the framework**: Custom workers do not
contribute to `work_units` on the framework's behalf —
migration tracking fires only if the user's `run` function
increments `work_units` and emits migrations directly.
- Scheduling gaps (`max_gap_ms`, `max_gap_cpu`, `max_gap_at_ms`)
record the longest wall-clock interval between consecutive
1024-work-unit migration-check points plus the CPU the gap
was observed on and its time from start. High values indicate
preemption or descheduling near a checkpoint boundary. The
checkpoint cadence — and therefore the gap-measurement
cadence — is governed by the same `work_units.is_multiple_of(1024)`
test that the migration tracker uses, so the effective
measurement period in outer iterations matches the per-WorkType
tables above.
### Benchmarking fields
Workers collect two categories of timing data:
**Per-wakeup latency** (`resume_latencies_ns`): timestamp-based samples
recorded around blocking operations. Populated for work types with a
blocking step: Bursty (sleep), PipeIo (pipe read), FutexPingPong
(futex wait), FutexFanOut (futex wait, receivers only), FanOutCompute
(futex wait, workers only — measured as `CLOCK_MONOTONIC` delta from
messenger's shared timestamp), CacheYield (yield), CachePipe (pipe
read), IoSyncWrite / IoRandRead / IoConvoy (pread / pwrite / fdatasync
blocking), NiceSweep (yield), AffinityChurn (yield),
PolicyChurn (yield), MutexContention (futex wait on contended
acquire), ForkExit (parent's waitpid wait), and Sequence when its
phases include Sleep, Yield, or Io.
Each sample is in nanoseconds; most work types use
`Instant::elapsed()` across the blocking call, while FanOutCompute
uses `clock_gettime(CLOCK_MONOTONIC)` to measure against the
messenger's pre-wake timestamp.
**schedstat deltas**: read from `/proc/self/schedstat` at work-loop
start and end. Three fields:
- `schedstat_cpu_time_ns` -- delta of field 1 (on-CPU time)
- `schedstat_run_delay_ns` -- delta of field 2 (time spent waiting
for a CPU)
- `schedstat_run_count` -- delta of field 3 (**pcount** —
scheduler-in count: incremented each time the scheduler picks
this task to execute, across CFS/EEVDF, FIFO/RR, and sched_ext
alike). Not a context-switch count — a task that keeps running
on the same CPU without leaving the runqueue does not see
pcount advance while it runs. For true context-switch counts
read `/proc/<pid>/status`'s `voluntary_ctxt_switches` and
`nonvoluntary_ctxt_switches`; the worker reads pcount instead
because schedstat delivers it alongside `run_delay` /
`cpu_time` in a single file read.
`iterations` counts outer-loop iterations.
### NUMA fields
**`numa_pages`**: per-NUMA-node page counts parsed from
`/proc/self/numa_maps` after the workload completes. Keyed by node ID.
Empty when numa_maps is unavailable.
**`vmstat_numa_pages_migrated`**: delta of the `numa_pages_migrated`
counter from `/proc/vmstat` between pre- and post-workload snapshots.
Measures cross-node page migrations during the test.
These fields feed the NUMA [checking
thresholds](../concepts/checking.md#numa-checks).
Custom workers produce their own `WorkerReport`. The framework does
not populate any telemetry fields for Custom -- migration tracking,
gap detection, schedstat deltas, NUMA page counts, and iteration
counters are only present if the user's `run` function fills them.
## Worker-progress watchdog
Workers send SIGUSR2 to the scheduler when stuck > 2 seconds. The
default POSIX disposition terminates the scheduler process, which ktstr
detects as a scheduler death and captures the sched_ext dump from
dmesg.
In repro mode, the watchdog is disabled to keep the scheduler alive
for BPF probe assertions. The watchdog does not fire for Custom
workers because they bypass the framework work loop.
## RAII cleanup
`WorkloadHandle` implements `Drop`: it sends SIGKILL to all child
processes and waits for them. This prevents orphaned worker processes
on error paths.