ktstr 0.5.2

Test harness for Linux process schedulers
# Worker Processes

Workers are the processes that generate load for scenarios. They run
inside the VM, each in its own cgroup.

## Fork, not threads

Workers are `fork()`ed processes. Cgroups operate on PIDs, so each
worker must be a separate process to be independently placed in a
cgroup.

## Two-phase start

Workers wait on a pipe for a "start" signal after fork:

1. Parent forks the worker.
2. Worker installs SIGUSR1 handler, then blocks on pipe read.
3. Parent moves the worker to its target cgroup.
4. Parent writes to the pipe, signaling the worker to start.

This ensures workers run inside their target cgroup from the first
instruction of their workload.

### Custom work types

`WorkType::Custom` workers follow the same two-phase start (fork,
cgroup placement, start signal), and the framework applies affinity
and scheduling policy before handing control to the user function.
After setup, the `run` function pointer takes over entirely --
the framework work loop is bypassed.

## Stop protocol

Workers install a SIGUSR1 handler that sets an atomic `STOP` flag. The
main work loop checks this flag each iteration. On stop:

1. Parent sends SIGUSR1 to all workers.
2. Workers exit their work loop.
3. Workers serialize their `WorkerReport` to a pipe.
4. Parent reads reports and waits for child exit.

## Telemetry

Each worker produces a `WorkerReport`:

```rust,ignore
pub struct WorkerReport {
    pub tid: i32,
    pub work_units: u64,
    pub cpu_time_ns: u64,
    pub wall_time_ns: u64,
    pub off_cpu_ns: u64,
    pub migration_count: u64,
    pub cpus_used: BTreeSet<usize>,
    pub migrations: Vec<Migration>,
    pub max_gap_ms: u64,
    pub max_gap_cpu: usize,
    pub max_gap_at_ms: u64,
    pub resume_latencies_ns: Vec<u64>,
    pub wake_sample_total: u64,
    pub iteration_costs_ns: Vec<u64>,
    pub iteration_cost_sample_total: u64,
    pub iterations: u64,
    pub schedstat_run_delay_ns: u64,
    pub schedstat_run_count: u64,
    pub schedstat_cpu_time_ns: u64,
    pub completed: bool,
    pub numa_pages: BTreeMap<usize, u64>,
    pub vmstat_numa_pages_migrated: u64,
    pub exit_info: Option<WorkerExitInfo>,
    pub is_messenger: bool,
    pub group_idx: usize,
    pub affinity_error: Option<String>,
}

pub enum WorkerExitInfo {
    Exited(i32),
    Signaled(i32),
    TimedOut,
    WaitFailed(String),
    /// Thread-mode worker panicked. Exclusive to `CloneMode::Thread`;
    /// fork workers surface panics via `Exited(1)` or
    /// `Signaled(SIGABRT)` depending on the panic strategy.
    Panicked(String),
}
```

`iteration_costs_ns` mirrors `resume_latencies_ns` for per-iteration
wall-clock cost: a reservoir-sampled vector capped at
`MAX_WAKE_SAMPLES` entries, paired with `iteration_cost_sample_total`
for the total observation count when the cap is exceeded.
`group_idx` is `0` for the primary group and `1..=N` for composed
`WorkSpec` entries in declaration order (mirrors
`WorkloadConfig::composed`). `affinity_error` is `Some(reason)`
when the worker's `sched_setaffinity` / `mbind` setup failed; the
worker still runs and produces a report but the field documents
the divergence from the requested affinity contract.

Three fields worth calling out explicitly:

- `wake_sample_total` — the TOTAL number of wake-latency
  observations the worker saw, including samples the reservoir
  sampler dropped. `resume_latencies_ns` is clamped to at most
  100_000 entries (`MAX_WAKE_SAMPLES`); on a long run that
  accumulates more wakes than the cap, the vector stays at the
  cap while this counter keeps climbing. Host-side consumers
  reporting "total wakeups observed" read `wake_sample_total`;
  percentile / CV computations read `resume_latencies_ns`.
- `completed``true` when the worker reached its natural end
  (outer loop observed STOP and exited cleanly, or a custom-
  closure payload returned from its `run`). Sentinel reports
  synthesised by `stop_and_collect`'s JSON-parse fallback carry
  `false`. Lets consumers distinguish "ran to completion, saw
  zero iterations" from "died / timed out before recording
  anything."
- `is_messenger``true` only for the messenger worker in a
  `FutexFanOut` / `FanOutCompute` group (the single writer that
  advances the shared generation and issues `futex_wake`).
  Enables per-worker latency-participation assertions —
  receivers produce `resume_latencies_ns` entries, messengers
  record wake-side work but no resume latency.

- `off_cpu_ns = wall_time_ns - cpu_time_ns`
- `exit_info` is `None` on every live-worker-authored report.
  `stop_and_collect` synthesises a sentinel `WorkerReport` with
  `Some(_)` when the worker handed back no (or unparseable) JSON,
  using the `WorkerExitInfo` enum
  (`Exited(code)` / `Signaled(signum)` / `TimedOut` /
  `WaitFailed(String)` — the string carries the underlying `waitpid`
  errno rendering) to preserve the reap shape for post-mortem.
- Migrations are tracked every 1024 work units: after each outer
  iteration the worker checks `work_units.is_multiple_of(1024)`
  and runs the migration-detect body iff that is true. The check
  runs exactly once per outer iteration, so the effective period
  in outer iterations is
  `1024 / gcd(units_per_iter, 1024)`. Default parameters assumed
  unless noted:
  - **Every outer iteration (period = 1 iter)**: SpinWait (1024),
    Mixed (1024), Bursty (each outer iter runs `spin_burst(1024)`
    some number of times inside the `burst_ms` loop — always a
    multiple of 1024), PipeIo (`burst_iters`=1024), FutexPingPong
    (`spin_iters`=1024), CachePressure (1024 strided RMW steps),
    CacheYield (1024 strided RMW steps), CachePipe
    (`burst_iters`=1024), FutexFanOut messenger AND receiver
    (both call `spin_burst(spin_iters)` before splitting roles;
    default 1024), AffinityChurn (`spin_iters`=1024), PolicyChurn
    (`spin_iters`=1024).
  - **Every 2 iterations**: NiceSweep (`spin_burst(512)` per iter
    `gcd(512, 1024) = 512`).
  - **Every 4 iterations**: MutexContention
    (`work_iters`=1024 + `hold_iters`=256 = 1280 per acquire+
    release → `gcd(1280, 1024) = 256`, period = 4 iters).
    FanOutCompute messenger (`spin_burst(256)` per wake cycle
    → same 256-unit gcd).
  - **Every 16 iterations**: PageFaultChurn — one persistent
    `MAP_PRIVATE | MAP_ANONYMOUS` region per worker (default
    4 MiB via `region_kb`=4096), re-faulted each outer
    iteration via `madvise(MADV_DONTNEED)`. Each iteration
    contributes `touches_per_cycle`=256 page writes (each first
    write after `MADV_DONTNEED` triggers a minor fault; a
    birthday-collision xorshift64 index may revisit a page
    already faulted this cycle, so the fault count is a ceiling,
    not a floor) + `spin_iters`=64 = 320 work units
    (`gcd(320, 1024) = 64`).
  - **Every 64 iterations**: IoSyncWrite (16 4-KiB writes per
    write-then-sleep pair → `gcd(16, 1024) = 16`); IoRandRead and
    IoConvoy use the same 64-iteration cadence for their per-iteration
    pread/pwrite mixes.
  - **Every 1024 iterations**: YieldHeavy (1 unit per yield),
    ForkExit (1 unit per fork+wait), FanOutCompute worker
    (`operations`=5 matrix multiplies per wake, one `work_units`
    tick per multiply → `gcd(5, 1024) = 1`).
  - **Phase-inherited**: Sequence inherits whichever phase is
    currently active — Spin / Yield / Io use the same per-unit
    accounting as the SpinWait / YieldHeavy / IoSyncWrite groups
    above; Sleep contributes no `work_units` and so pauses migration
    checks while it runs.
  - **Not tracked by the framework**: Custom workers do not
    contribute to `work_units` on the framework's behalf —
    migration tracking fires only if the user's `run` function
    increments `work_units` and emits migrations directly.
- Scheduling gaps (`max_gap_ms`, `max_gap_cpu`, `max_gap_at_ms`)
  record the longest wall-clock interval between consecutive
  1024-work-unit migration-check points plus the CPU the gap
  was observed on and its time from start. High values indicate
  preemption or descheduling near a checkpoint boundary. The
  checkpoint cadence — and therefore the gap-measurement
  cadence — is governed by the same `work_units.is_multiple_of(1024)`
  test that the migration tracker uses, so the effective
  measurement period in outer iterations matches the per-WorkType
  tables above.

### Benchmarking fields

Workers collect two categories of timing data:

**Per-wakeup latency** (`resume_latencies_ns`): timestamp-based samples
recorded around blocking operations. Populated for work types with a
blocking step: Bursty (sleep), PipeIo (pipe read), FutexPingPong
(futex wait), FutexFanOut (futex wait, receivers only), FanOutCompute
(futex wait, workers only — measured as `CLOCK_MONOTONIC` delta from
messenger's shared timestamp), CacheYield (yield), CachePipe (pipe
read), IoSyncWrite / IoRandRead / IoConvoy (pread / pwrite / fdatasync
blocking), NiceSweep (yield), AffinityChurn (yield),
PolicyChurn (yield), MutexContention (futex wait on contended
acquire), ForkExit (parent's waitpid wait), and Sequence when its
phases include Sleep, Yield, or Io.
Each sample is in nanoseconds; most work types use
`Instant::elapsed()` across the blocking call, while FanOutCompute
uses `clock_gettime(CLOCK_MONOTONIC)` to measure against the
messenger's pre-wake timestamp.

**schedstat deltas**: read from `/proc/self/schedstat` at work-loop
start and end. Three fields:
- `schedstat_cpu_time_ns` -- delta of field 1 (on-CPU time)
- `schedstat_run_delay_ns` -- delta of field 2 (time spent waiting
  for a CPU)
- `schedstat_run_count` -- delta of field 3 (**pcount**  scheduler-in count: incremented each time the scheduler picks
  this task to execute, across CFS/EEVDF, FIFO/RR, and sched_ext
  alike). Not a context-switch count — a task that keeps running
  on the same CPU without leaving the runqueue does not see
  pcount advance while it runs. For true context-switch counts
  read `/proc/<pid>/status`'s `voluntary_ctxt_switches` and
  `nonvoluntary_ctxt_switches`; the worker reads pcount instead
  because schedstat delivers it alongside `run_delay` /
  `cpu_time` in a single file read.

`iterations` counts outer-loop iterations.

### NUMA fields

**`numa_pages`**: per-NUMA-node page counts parsed from
`/proc/self/numa_maps` after the workload completes. Keyed by node ID.
Empty when numa_maps is unavailable.

**`vmstat_numa_pages_migrated`**: delta of the `numa_pages_migrated`
counter from `/proc/vmstat` between pre- and post-workload snapshots.
Measures cross-node page migrations during the test.

These fields feed the NUMA [checking
thresholds](../concepts/checking.md#numa-checks).

Custom workers produce their own `WorkerReport`. The framework does
not populate any telemetry fields for Custom -- migration tracking,
gap detection, schedstat deltas, NUMA page counts, and iteration
counters are only present if the user's `run` function fills them.

## Worker-progress watchdog

Workers send SIGUSR2 to the scheduler when stuck > 2 seconds. The
default POSIX disposition terminates the scheduler process, which ktstr
detects as a scheduler death and captures the sched_ext dump from
dmesg.

In repro mode, the watchdog is disabled to keep the scheduler alive
for BPF probe assertions. The watchdog does not fire for Custom
workers because they bypass the framework work loop.

## RAII cleanup

`WorkloadHandle` implements `Drop`: it sends SIGKILL to all child
processes and waits for them. This prevents orphaned worker processes
on error paths.