ktstr 0.5.2

Test harness for Linux process schedulers
# Ops and Steps

The ops system is a composable way to express dynamic cgroup topology
changes. It replaces hand-written `Action::Custom` functions for most
dynamic scenarios.

## Op

An `Op` is an atomic operation on the cgroup topology. The enum is
`#[non_exhaustive]`, so external pattern matches must end with `..` to
stay compatible across ktstr version bumps that add new variants:

| Op | Description |
|---|---|
| `AddCgroup` | Create a cgroup |
| `RemoveCgroup` | Stop workers and remove a cgroup |
| `SetCpuset` | Set a cgroup's cpuset via `CpusetSpec` |
| `ClearCpuset` | Remove cpuset constraints |
| `SwapCpusets` | Swap cpusets between two cgroups |
| `Spawn` | Fork workers into a cgroup |
| `StopCgroup` | Stop a cgroup's workers |
| `SetAffinity` | Set worker affinity via `AffinityIntent` |
| `SpawnHost` | Spawn workers in the parent cgroup |
| `MoveAllTasks` | Move all tasks from one cgroup to another |
| `RunPayload` | Spawn a binary-kind [`Payload`](../writing-tests/scheduler-definitions.md#derive-payload) in the background and track its `PayloadHandle` under the step's payload set. Subsequent `WaitPayload` / `KillPayload` address it by `(payload.name, cgroup)`. Scheduler-kind payloads are rejected at apply time. |
| `WaitPayload` | Block until the named payload exits naturally, evaluate its checks, and record metrics to the per-test sidecar. Target lookup is by `(name, cgroup)` composite key; `cgroup: None` resolves to the unique live copy. No timeout — pair with a bounded `HoldSpec` or the payload's own `--runtime` for time-boxed runs. |
| `KillPayload` | SIGKILL the named payload, reap the child, evaluate checks, and record metrics. Same `(name, cgroup)` lookup rules as `WaitPayload`. Mirrors step-teardown drain for an explicitly-targeted payload. |
| `FreezeCgroup` | Freeze every task in the named cgroup via `cgroup.freeze` (kernel-side asynchronous freeze; not a SIGSTOP). Idempotent for already-frozen cgroups. Pair with `UnfreezeCgroup` to release; teardown auto-unfreezes. See [Snapshots](../writing-tests/snapshots.md) for the observer-cgroup deadlock warning. |
| `UnfreezeCgroup` | Unfreeze every task in the named cgroup via `cgroup.freeze`. Inverse of `FreezeCgroup`. Idempotent. |
| `Snapshot` | Capture a host-side diagnostic snapshot under `name` via the freeze coordinator: pauses every vCPU, reads BPF map state, vCPU registers, and per-CPU counters into a `FailureDumpReport`, then resumes. The report is keyed by `name` on the active `SnapshotBridge`. No active bridge is a no-op with `tracing::warn!`. See [Snapshots](../writing-tests/snapshots.md). |
| `WatchSnapshot` | Capture a snapshot whenever the guest writes to the named kernel symbol; one fire = one capture tagged with the symbol path. Symbol resolution at op execution time looks the name up by **verbatim vmlinux ELF symbol-table match** — the requested name must appear in the guest kernel's static symbol table exactly as written (no path expansion, no BTF descent). Maximum 3 watch ops per scenario (3 hardware watchpoint slots; 1 slot reserved for the error-class exit_kind trigger). See [Watch Snapshots](../writing-tests/watch-snapshots.md). |

Op constructors accept string literals directly (no `.into()` needed):

```rust,ignore
Op::add_cgroup("cg_0")
Op::set_cpuset("cg_0", CpusetSpec::disjoint(0, 2))
Op::stop_cgroup("cg_0")
Op::spawn("cg_0", WorkSpec::default().workers(4))
Op::set_affinity("cg_0", AffinityIntent::RandomSubset)
Op::spawn_host(WorkSpec::default().workers(4))
Op::freeze_cgroup("cg_0")
Op::unfreeze_cgroup("cg_0")
Op::snapshot("after_spawn")
Op::watch_snapshot("jiffies_64")
```

`SpawnHost` creates workers in the parent cgroup, not in a managed
cgroup. Use this to simulate host-level CPU contention alongside
managed cgroups.

## OpKind

`OpKind` is a payload-free discriminant enum generated from `Op` via
`#[strum_discriminants]`. It carries the same variant set as `Op`
(`AddCgroup`, `RemoveCgroup`, ..., `RunPayload`, `WaitPayload`,
`KillPayload`, `FreezeCgroup`, `UnfreezeCgroup`, `Snapshot`,
`WatchSnapshot`) with none of the inner fields, so it is cheap to
copy and use as a map key. Framework code uses `OpKind` when it
only cares WHICH operation ran (per-op statistics, stimulus-event
tagging, verifier/monitor bookkeeping) without the payload. Test
authors rarely spell `OpKind` directly — the `strum::EnumIter`
derive also lets tooling enumerate every `OpKind` variant for
coverage checks.

`OpKind` shares `Op`'s `#[non_exhaustive]` attribute: external
pattern matches over `OpKind` must end with `..`.

## CpusetSpec

`CpusetSpec` computes a cpuset from the topology at runtime. The enum
is `#[non_exhaustive]`, so external callers should construct via the
associated constructor functions (see the list below this snippet)
rather than naming variant literals — a future field addition (e.g.
a stride on `Range`) can land behind a defaulted parameter without
breaking call sites. Pattern matches over `CpusetSpec` must also end
with `..`:

```rust,ignore
pub enum CpusetSpec {
    Llc(usize),                          // All CPUs in an LLC
    Numa(usize),                         // All CPUs in a NUMA node
    Range { start_frac: f64, end_frac: f64 }, // Fraction of usable CPUs
    Disjoint { index: usize, of: usize },     // Equal disjoint partitions
    Overlap { index: usize, of: usize, frac: f64 }, // Overlapping partitions
    Exact(BTreeSet<usize>),              // Exact CPU set
}
```

Convenience constructors accept parameters directly:
`CpusetSpec::disjoint(0, 2)`, `CpusetSpec::range(0.0, 0.5)`,
`CpusetSpec::exact([0, 1, 2])`, `CpusetSpec::llc(0)`,
`CpusetSpec::numa(0)`, `CpusetSpec::overlap(0, 2, 0.5)`.

All fractional specs operate on
[`usable_cpus()`](topology.md#topology-queries).

## CgroupDef

`CgroupDef` bundles three ops that always go together: create cgroup,
set cpuset, spawn workers. It is the primary way to define cgroups in
ops-based scenarios.

```rust,ignore
let def = CgroupDef::named("cg_0")
    .with_cpuset(CpusetSpec::disjoint(0, 2))
    .workers(4)
    .work_type(WorkType::SpinWait);
```

### Builder methods

- `.with_cpuset(CpusetSpec)` -- set the cpuset (CPU set the cgroup is
  pinned to).
- `.with_cpuset_mems(BTreeSet<usize>)` -- explicit `cpuset.mems`
  override (default derives from the resolved cpuset's NUMA nodes).
- `.workers(n)` -- set worker count.
- `.work_type(WorkType)` -- set work type (default: `SpinWait`).
- `.sched_policy(SchedPolicy)` -- set Linux scheduling policy
  (default: `Normal`). See [WorkSpec Types](work-types.md#scheduling-policies).
- `.work(WorkSpec)` -- add a work group (multiple calls for concurrent groups).
- `.workload(&'static Payload)` -- attach a binary workload payload
  to run alongside the worker group; the framework launches it as a
  child process inside the cgroup. **Panics** when called with a
  scheduler-kind `Payload` (`PayloadKind::Scheduler(_)`); the
  scheduler slot is `#[ktstr_test(scheduler = ...)]` at the test
  level, not the cgroup-level `workload` slot. Step-level
  `Op::RunPayload` rejects scheduler-kind payloads with an
  `anyhow::Error` instead of panicking; the build-time `workload`
  call panics because there is no scenario-level recovery path.
- `.affinity(AffinityIntent)` -- set per-worker affinity (default: `Inherit`).
- `.mem_policy(MemPolicy)` -- set NUMA memory placement policy
  (default: `Default`). See [MemPolicy](mem-policy.md).
- `.mpol_flags(MpolFlags)` -- set mode flags for `set_mempolicy(2)`
  (default: `NONE`). See [MemPolicy](mem-policy.md#mpolflags).
- `.nice(n)` -- cgroup-level default per-worker nice value, merged
  into every WorkSpec whose own `nice` is unset. See
  [Tutorial: Step 11](../tutorial.md#step-11-name-and-prioritize-workers).
- `.comm(name)` -- cgroup-level default per-worker `task->comm` via
  `prctl(PR_SET_NAME)`. Merged into every WorkSpec whose own `comm`
  is unset.
- `.pcomm(name)` -- thread-group-leader `task->comm` for the
  fork-then-thread spawn path (workers run as threads under one
  forked leader). Stamps every existing WorkSpec in-place; not
  order-independent with `.work(...)`.
- `.uid(uid)` / `.gid(gid)` -- cgroup-level default per-worker
  effective UID / GID via `setresuid` / `setresgid`. Merged into
  every WorkSpec whose own `uid` / `gid` is unset.
- `.numa_node(node)` -- cgroup-level default NUMA-node affinity for
  every WorkSpec. Merged at apply-setup time.
- `.swappable(bool)` -- opt into gauntlet work type override.

#### Cgroup controllers

The cgroup-v2 cpu / memory / io / pids controllers are exposed as
typed setters (default: unconstrained):

- `.cpu_quota_pct(pct)` / `.cpu_quota(quota, period)` /
  `.cpu_unlimited()` -- write `cpu.max` (`pct` is shorthand: `100`
  = one full CPU). `cpu_unlimited` resets to the kernel default.
- `.cpu_weight(weight)` -- write `cpu.weight` (`1..=10000`,
  default `100`).
- `.memory_max(bytes)` / `.memory_high(bytes)` /
  `.memory_low(bytes)` / `.memory_unlimited()` -- write
  `memory.max` / `memory.high` / `memory.low`. `memory_unlimited`
  resets `memory.max` to `max`.
- `.memory_swap_max(bytes)` / `.memory_swap_unlimited()` -- write
  `memory.swap.max`.
- `.io_weight(weight)` -- write `io.weight` (`1..=10000`,
  default `100`).
- `.pids_max(n)` / `.pids_unlimited()` -- write `pids.max`.

### MemPolicy-cpuset validation

When a cgroup has a cpuset, ktstr validates that the `MemPolicy`'s
node set is covered by the NUMA nodes reachable from that cpuset. A
`MemPolicy::Bind([1])` on a cgroup whose cpuset covers only NUMA
node 0 fails at setup time. Policies without a node set (`Default`,
`Local`) skip validation.

### WorkSpec type overrides and swappable

`CgroupDef` has a `swappable` flag (default: `false`). When `true`
and a work type override is active (`Ctx.work_type_override`), the
override replaces this def's work type.

In contrast, the `Scenario`-level override (in `run_scenario()`) only
replaces `SpinWait` work types. The two mechanisms serve different
scopes:

- **Scenario-level**: replaces `SpinWait` in `WorkSpec.work_type`
- **CgroupDef-level**: replaces the work type when `swappable = true`

Both skip overrides to grouped work types when `num_workers` is not
divisible by the work type's group size.

WorkSpec type overrides apply only to `CgroupDef` setup, not to raw
`Op::Spawn`. `Op::Spawn` always uses the work type as given. Use
`CgroupDef` with `.swappable(true)` when the work type should
participate in gauntlet overrides.

## Step

A `Step` is a sequence of ops with a hold period:

```rust,ignore
pub struct Step {
    pub setup: Setup,   // CgroupDefs to create after ops
    pub ops: Vec<Op>,   // Operations to apply
    pub hold: HoldSpec, // How long to wait after
}
```

`Setup` is either `Defs(Vec<CgroupDef>)` or `Factory(fn(&Ctx) -> Vec<CgroupDef>)`.
`Vec<CgroupDef>` implements `Into<Setup>`, so you can write
`setup: vec![...].into()` instead of `setup: Setup::Defs(vec![...])`.

### Constructors

**`Step::new(ops, hold)`** -- creates a step with ops only (no
CgroupDef setup). Use when the step only applies dynamic operations
to an existing topology.

**`Step::with_defs(defs, hold)`** -- creates a step with CgroupDef
setup and a hold period. The primary constructor for steps that
create cgroups with workers.

**`Step::set_ops(self, ops)`** -- REPLACES the ops on a step
(builder method). Chain after `with_defs` to add dynamic operations
to a step that also creates cgroups.

> **Naming asymmetry:** `Step::set_ops` REPLACES; the sibling
> `Backdrop::with_ops` APPENDS. The two methods deliberately use
> different verbs to signal the different semantics. A
> `Step::new(ops).set_ops(more)` chain produces a step whose ops
> vec is exactly `more` (the original `ops` is dropped); a
> `Backdrop::new().with_ops(ops_a).with_ops(ops_b)` chain
> produces a backdrop whose ops vec is `ops_a + ops_b`. If you
> need to extend a step's ops vec, build the combined `Vec<Op>`
> at the call site and pass it to `set_ops`, or compose at the
> `Backdrop` layer instead.

## HoldSpec

How long to hold after a step completes:

| Variant | Description |
|---|---|
| `Frac(f64)` | Fraction of the total scenario duration |
| `Fixed(Duration)` | Fixed time |
| `Loop { interval }` | Repeat ops at interval until time runs out |

`HoldSpec::FULL` is a constant for `Frac(1.0)` (hold for the full
scenario duration).

## execute_defs

`execute_defs(ctx, defs)` is a convenience wrapper for the common
pattern of creating cgroups and running them for the full duration:

```rust,ignore
execute_defs(ctx, vec![
    CgroupDef::named("cg_0").workers(4),
    CgroupDef::named("cg_1").workers(4),
])
```

Equivalent to `execute_steps(ctx, vec![Step::with_defs(defs, HoldSpec::FULL)])`.

## execute_steps

`execute_steps(ctx, steps)` runs a step sequence:

1. For each step: apply ops, then apply setup (create cgroups from
   `CgroupDef`s), hold for the specified duration. Ops run first so
   parent cgroups can be created before children are spawned.
   `Loop` steps reverse this: setup runs once before the loop, then
   ops repeat at the specified interval.
2. Check scheduler liveness between steps.
3. After all steps: collect worker reports and run checks.
4. Writes stimulus events to the SHM ring buffer for timeline analysis.

## execute_steps_with

`execute_steps_with(ctx, steps, assertions)` is the same as
`execute_steps` but accepts an explicit
[`Assert`](checking.md#assert-struct) for worker checks.
`execute_steps` is a convenience wrapper that passes `None`.

```rust,ignore
use ktstr::prelude::*;

fn my_scenario(ctx: &Ctx) -> Result<AssertResult> {
    let assertions = Assert::NO_OVERRIDES
        .check_not_starved()
        .max_gap_ms(3000);

    let steps = vec![/* ... */];
    execute_steps_with(ctx, steps, Some(&assertions))
}
```

When `assertions` is `Some`, the provided `Assert` overrides `ctx.assert`
for worker checks. When `None`, uses `ctx.assert` (the merged
three-layer config: `default_checks` -> scheduler -> per-test).