ktstr 0.5.2

Test harness for Linux process schedulers
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
# Snapshots

A **snapshot** is a frozen record of guest BPF map state and scheduler
globals captured at a specific point in a scenario. The freeze
coordinator pauses every vCPU long enough to walk the kernel's BPF
maps, BTF-render every captured value, and bundle the result into a
`FailureDumpReport` keyed by a name you choose. Test code then reads
it back via the [`Snapshot`](#reading-the-captured-report) accessor for typed traversal.

`Op::snapshot("name")` is the **on-demand** capture trigger. Use it to
ask "what does the scheduler look like *right now*?" at a precise
point in the scenario. For automatic capture on a kernel write to a
specific symbol, see [Watch Snapshots](watch-snapshots.md). For
**cadenced** capture across the workload window without invoking
`Op::snapshot` from the scenario body, see
[Periodic Capture](periodic-capture.md) — it produces a time-ordered
[`SampleSeries`](temporal-assertions.md#sampleseries) that flows into
the [temporal-assertion](temporal-assertions.md) patterns
(`nondecreasing`, `rate_within`, `steady_within`, `converges_to`,
`always_true`, `ratio_within`).

## Issuing a snapshot

`Op::snapshot(name)` is a single op in a [`Step`](../concepts/ops.md)'s op list. The
executor invokes the active [`SnapshotBridge`](#wiring-the-bridge)'s capture callback,
which performs the freeze rendezvous and returns the report; the
bridge stores the report under `name`.

```rust,ignore
use ktstr::prelude::*;

let steps = vec![Step {
    setup: vec![CgroupDef::named("workers").workers(2)].into(),
    ops: vec![
        Op::snapshot("after_spawn"),
        // ... other ops ...
        Op::snapshot("after_workload"),
    ],
    hold: HoldSpec::FULL,
}];
execute_steps(ctx, steps)?;
```

A scenario may issue any number of `Op::snapshot` ops with distinct
names. Reusing a name overwrites the prior capture (and emits a
`tracing::warn!`).

## Wiring the bridge

The bridge is what turns an `Op::snapshot` into stored data. The host
typically wires it before `execute_steps` runs, but a scenario can
install one inline:

```rust,ignore
use ktstr::prelude::*;

let cb: CaptureCallback = std::sync::Arc::new(|_name: &str| {
    // Production: freeze the VM and build a real FailureDumpReport.
    // Tests: return a hand-crafted report so the executor + bridge
    // pipeline runs without booting a guest.
    Some(FailureDumpReport::default())
});
let bridge = SnapshotBridge::new(cb);
let bridge_handle = bridge.clone();
let _guard = bridge.set_thread_local();

execute_steps(ctx, steps)?;

let captured = bridge_handle.drain();
let report = captured.get("after_spawn").expect("snapshot recorded");
```

`set_thread_local` returns a [`BridgeGuard`](#wiring-the-bridge) that restores the prior
bridge on drop, so a nested scenario inside an outer one cannot leak
its bridge into the outer scope. Bind the guard to an
underscore-prefixed identifier such as `_guard` so the binding lives
for the scope of the scenario — a bare `let _ = bridge.set_thread_local()`
drops the guard immediately and clears the bridge before any op runs.
`must_use` will warn if the return value is discarded entirely.

If no bridge is installed, `Op::snapshot` is a no-op with a
`tracing::warn!` and the scenario continues. If the capture callback
returns `None` (capture pipeline unavailable), the bridge stays empty
and the scenario continues. Existing scenarios that never declare
snapshot ops keep working unchanged.

## Reading the captured report

[`Snapshot::new(report)`](#reading-the-captured-report) builds a borrowed view over a
`FailureDumpReport`. The view does not copy the report; accessor
methods walk the report in place and return further borrowed views.

### Map-name lookup

```rust,ignore
let snap = Snapshot::new(report);
let map = snap.map("scx_per_task")?;        // SnapshotMap
```

`Snapshot::map(name)` returns `Result<SnapshotMap, SnapshotError>`. A
miss yields `SnapshotError::MapNotFound { requested, available }` —
the `available` list enumerates every captured map name so a typo
surfaces in test output.

### Top-level globals (.bss / .data / .rodata)

```rust,ignore
let nr_cpus = snap.var("nr_cpus_onln").as_u64()?;
```

`Snapshot::var(name)` walks every `*.bss`, `*.data`, and `*.rodata`
global-section map for a top-level member named `name` and returns
the unique match as a [`SnapshotField`](#terminal-accessors).
Multiple matches yield
`SnapshotError::AmbiguousVar { requested, found_in }` —
disambiguate via `Snapshot::map(name)`. A miss yields
`SnapshotError::VarNotFound { requested, available }` with the
union of every section's top-level member names.

### Entries inside a map

```rust,ignore
let map = snap.map("scx_per_task")?;
let first = map.at(0);                          // by ordinal index
let busy = map.find(|e| e.get("tid").as_i64().unwrap_or(-1) == 1234);
let busiest = map.max_by(|e| e.get("runtime_ns").as_u64().unwrap_or(0));
let all_active = map.filter(|e| e.get("runtime_ns").as_u64().unwrap_or(0) > 0);
```

`SnapshotMap` exposes:

- `at(n)` — entry at ordinal index `n`. Out of range returns
  `SnapshotEntry::Missing(SnapshotError::IndexOutOfRange)`.
- `find(predicate)` — first matching entry. No match returns
  `SnapshotEntry::Missing(SnapshotError::NoMatch { op: "find", ... })`.
- `filter(predicate)` — every matching entry collected into a `Vec`.
- `max_by(key_fn)` — entry whose `key_fn` produces the maximum `u64`.
  Empty map returns `Missing` with `op: "max_by"`.

### Per-CPU maps

`BPF_MAP_TYPE_PERCPU_ARRAY` / `_PERCPU_HASH` / `_LRU_PERCPU_HASH` maps
require narrowing to a CPU before reading individual values:

```rust,ignore
let map = snap.map("scx_pcpu")?;
let entry = map.cpu(1).at(0);                    // CPU 1's slot
let value = entry.get("").as_u64()?;             // empty path = root
```

`SnapshotMap::cpu(n)` narrows subsequent `at` / `find` calls to a
specific CPU's slot. An out-of-range CPU returns `Missing` with
`SnapshotError::PerCpuSlot { unmapped: false, len, ... }`; an
unmapped slot (`None` in the per-CPU vec) returns the same error
variant with `unmapped: true`.

Calling `entry.get(path)` on a per-CPU entry **without** narrowing
first surfaces `SnapshotError::PerCpuNotNarrowed { map }` — call
`.cpu(N)` first.

## Field accessors and dotted paths

`SnapshotEntry::get(path)` and `SnapshotField::get(path)` walk the
entry's value side along a dotted path. Each component matches a
struct member; pointer dereferences are followed transparently.

```rust,ignore
let weight = entry.get("ctx.weight").as_u64()?;
let policy = entry.get("ctx.policy").as_str()?;     // enum variant name
let pid    = entry.get("leader.pid").as_i64()?;     // pointer chase
```

The dotted-path walker:

1. **Pointer chase.** When a path step lands on
   `RenderedValue::Ptr { deref: Some(...) }`, the walker
   transparently follows the dereference (up to 16 hops) before
   matching the next component. The test author writes the path the
   BTF would suggest; pointer indirection is invisible.

2. **Empty path.** `get("")` returns the current value as a
   `SnapshotField::Value` — useful for terminal accessors on per-CPU
   slots that hold a scalar directly.

3. **Composability.** Two-segment paths are equivalent to chained
   `get` calls: `entry.get("ctx.weight")`   `entry.get("ctx").get("weight")`.

   Note that [`Snapshot::var`]#top-level-globals-bss--data--rodata does **not** split — it treats the full
   string as one global name. To walk into a struct, use
   `snap.var("ctx").get("weight")`.

### Terminal accessors

`SnapshotField` exposes typed terminal reads, all returning
`Result<T, SnapshotError>`:

| Method | Returns | Accepts |
|---|---|---|
| `as_u64()` | `u64` | `Uint`, non-negative `Int`/`Enum`, `Bool` (0/1), `Char` (raw byte), `Ptr` (pointer value, including cast-recovered pointers — see [Cast-recovered pointers]#cast-recovered-pointers), per-CPU array key |
| `as_i64()` | `i64` | `Int`, `Uint` ≤ i64::MAX, `Bool`, `Char`, `Enum`, per-CPU array key |
| `as_bool()` | `bool` | `Bool` direct; `Int`/`Uint`/`Char`/`Enum`/`Ptr` non-zero is true; per-CPU array key |
| `as_f64()` | `f64` | `Float`, `Int`, `Uint`, `Enum`, per-CPU array key |
| `as_str()` | `&str` | `Enum` with a resolved variant name |
| `rendered()` | `Option<&RenderedValue>` | the underlying value when present |

Type mismatches surface as `SnapshotError::TypeMismatch { expected,
actual, requested }` — for example, `as_str()` on a `Uint` reports
`expected: "Enum"`, `actual: "Uint"`.

### Cast-recovered pointers

Schedulers stash kernel pointers (`task_struct *`, `cgroup *`, …)
and arena pointers in BPF map fields whose BTF declares them as
`u64` because BTF cannot express a pointer to a per-allocation
type. The host-side
[cast analyzer](../architecture/monitor.md#cast-analysis) walks the
scheduler's `.bpf.o` instruction stream during load, recovers the
target struct for each provable `(source_struct, field_offset) →
target_struct` mapping, and feeds the result into the renderer.

When the renderer encounters a `u64` slot the analyzer flagged, it
emits a [`RenderedValue::Ptr`](#field-accessors-and-dotted-paths)
with `cast_annotation` set and chases the dereference through the
address-space-appropriate reader. The full set of
`cast_annotation` values:

| Annotation | Meaning |
|---|---|
| `"cast→arena"` | Cast analyzer flagged a `u64` field; chase resolved to an arena allocation via the BTF-typed pointee. |
| `"cast→kernel"` | Cast analyzer flagged a `u64` field; chase resolved to a kernel slab / vmalloc / per-cpu allocation. |
| `"sdt_alloc"` | BTF-typed `Type::Ptr` whose pointee was a `BTF_KIND_FWD`; the renderer recovered the real payload struct id via the `sdt_alloc` bridge. No cast-analyzer hit was involved. |
| `"cast→arena (sdt_alloc)"` | Cast analyzer flagged a `u64` field AND the chase target peeled to a Fwd; the bridge recovered the real arena payload struct id. |
| `"cast→kernel (sdt_alloc)"` | Cast analyzer flagged a `u64` field AND the chase target peeled to a Fwd; the bridge recovered the real kernel-side struct id. |

A parallel cross-BTF Fwd resolution path is consulted whenever a
chase target survives the local same-BTF Fwd resolve as a
`BTF_KIND_FWD`: when the body lives in a sibling embedded BPF
object's BTF (the multi-`.bpf.objs` shape), the renderer switches
the recursion to that sibling BTF and renders the full body.
Cross-BTF resolution does NOT add a new annotation — the body is
recovered transparently and the rendered subtree carries whichever
annotation (`"cast→arena"`, `"cast→kernel"`, or `None` for a
BTF-typed `Type::Ptr`) it would have had if the same struct lived
in the entry BTF.

From the test author's perspective:

- `as_u64()` returns the raw pointer value (matching pre-analysis
  behavior, so existing tests do not need updating).
- `entry.get("ctx.task")` and similar dotted-path walks transparently
  follow the cast-recovered chase; nested struct fields appear under
  the same path the BTF would suggest for a natively-typed pointer.
- The `cast_annotation` is visible in failure-dump rendering and
  diagnostic output so an operator can distinguish cast-recovered
  pointers from BTF-typed ones; the test API does not require any
  extra calls to consume them.

## Error handling

[`SnapshotError`](#error-handling) is the unified error type for every fallible
accessor. Each variant carries the path or available alternatives
needed to fix the call site without re-running the test:

- `MapNotFound { requested, available }``Snapshot::map(name)` miss.
- `VarNotFound { requested, available }``Snapshot::var(name)` miss.
- `AmbiguousVar { requested, found_in }` — more than one
  `*.bss`/`*.data`/`*.rodata` map exposes a top-level member with the
  requested name. `found_in` lists every map (in capture order)
  where the name was seen; disambiguate via `Snapshot::map(name)` +
  `.at(0).get(...)` against a specific map.
- `FieldNotFound { requested, walked, component, available }` — a
  path component did not match any struct member at that depth.
  `walked` is the prefix that resolved successfully; `component` is
  the failing segment; `requested` is the original user-supplied
  path.
- `NotAStruct { requested, walked, component, kind }` — a path
  component reached a non-struct value where a struct was expected
  (e.g. descending into a `Uint` leaf). `kind` names the actual
  variant.
- `TypeMismatch { expected, actual, requested }` — terminal
  accessor called on a rendered shape it cannot decode. `expected`
  names the scalar type the accessor requires; `actual` names the
  rendered variant; `requested` is the user-supplied lookup string
  (empty when the accessor was invoked on a leaf without a path
  walk).
- `IndexOutOfRange { map, index, len }``SnapshotMap::at(n)` past
  the entry list end.
- `PerCpuSlot { map, cpu, len, unmapped }` — out-of-range or unmapped
  per-CPU slot; `unmapped: true` distinguishes a `None` slot from an
  out-of-range CPU.
- `NoMatch { map, op }` — predicate-based lookup (`find`, `max_by`)
  found no match. `op` names the operation.
- `EmptyPathComponent { requested }` — a path string contained an
  empty component (e.g. `"a..b"`).
- `PerCpuNotNarrowed { map }``entry.get` called on a per-CPU entry
  without `cpu(N)` first.
- `NoRendered { map, side }` — entry has no rendered key/value side
  (BTF type id missing at capture time, leaving hex bytes only).
- `PlaceholderSample { tag, reason }` — a periodic-capture sample's
  underlying `FailureDumpReport` is a placeholder produced by the
  freeze-rendezvous timeout fallback. Surfaces when projecting via
  [`SampleSeries::bpf`]temporal-assertions.md#projecting-from-bpf-state;
  temporal patterns route the variant through their skip path so a
  placeholder never falsely registers as zero progress against a
  monotonicity / rate / steady / ratio band. `reason` carries the
  rendezvous-timeout cause text.
- `MissingStats { tag }` — a [`SampleSeries::stats`]temporal-assertions.md#projecting-from-scx_stats-json
  projection ran on a sample whose `stats` slot is `None` (stats
  client not wired or per-sample stats request failed). Distinct
  from in-JSON path misses (`FieldNotFound` / `TypeMismatch`) so the
  assertion site can branch on the cause without re-walking the
  source.

`SnapshotError` implements `std::error::Error` and `Display`, so it
composes with `?` and `anyhow`. The `Display` impl includes the path
and any available alternatives so a failure message points the test
author at the fix.

## Worked example

Capture a snapshot, look up a map, walk into its first entry, and
read a nested field:

```rust,ignore
use ktstr::prelude::*;

fn snapshot_then_inspect(ctx: &Ctx) -> Result<AssertResult> {
    // Wire a bridge for the duration of the scenario.
    let cb: CaptureCallback = std::sync::Arc::new(|_name| {
        // Production: freeze + build a real FailureDumpReport. The
        // host installs this callback in real runs.
        Some(FailureDumpReport::default())
    });
    let bridge = SnapshotBridge::new(cb);
    let handle = bridge.clone();
    let _guard = bridge.set_thread_local();

    // Run the scenario, capturing once after spawn.
    let steps = vec![Step {
        setup: vec![CgroupDef::named("workers").workers(2)].into(),
        ops: vec![Op::snapshot("after_spawn")],
        hold: HoldSpec::FULL,
    }];
    let mut result = execute_steps(ctx, steps)?;

    // Drain the bridge and inspect the captured report.
    let captured = handle.drain();
    let report = captured
        .get("after_spawn")
        .ok_or_else(|| anyhow::anyhow!("snapshot 'after_spawn' missing"))?;
    let snap = Snapshot::new(report);

    // Top-level scalar.
    if let Ok(nr_cpus) = snap.var("nr_cpus_onln").as_u64() {
        result.details.push(AssertDetail::new(
            DetailKind::Other,
            format!("captured nr_cpus_onln = {nr_cpus}"),
        ));
    }

    Ok(result)
}
```

For the executor + bridge wiring outside a VM, see the host-side
smoke tests in `tests/snapshot_e2e.rs` — they exercise the same
pipeline against a hand-crafted `FailureDumpReport` so the assertion
shape is covered without booting a guest.

## Composing reads with writes

Snapshots are the **read** half of the host↔guest interaction. The
**write** half — pre-seeding a BPF map value before the scenario
starts — is the `#[ktstr_test]` attribute `bpf_map_write = CONST`,
which targets a `BpfMapWrite` constant:

```rust,ignore
use ktstr::prelude::*;

const TRIGGER_FAULT: BpfMapWrite = BpfMapWrite {
    map_name_suffix: ".bss",   // matched against discovered maps
    offset: 42,                // byte offset within the map's value
    value: 1,                  // u32 written by the host
};

#[ktstr_test(bpf_map_write = TRIGGER_FAULT, expect_err = true)]
fn fault_then_inspect(ctx: &Ctx) -> Result<AssertResult> {
    // The host has already written `1` at `.bss + 42` before
    // the scenario started. Capture and inspect the resulting
    // scheduler state mid-run.
    /* bridge wiring + Op::snapshot + Snapshot::new as above */
    Ok(AssertResult::pass())
}
```

The write is event-driven: the host polls for BPF map
discoverability (scheduler loaded), polls the SHM ring for
scenario start, then writes the configured u32 at the configured
offset. Only `BPF_MAP_TYPE_ARRAY` maps are supported; the framework
finds the map by `map_name_suffix` (e.g. `".bss"`) via
`BpfMapAccessor::find_map`. See [Monitor → BPF map writes](../architecture/monitor.md)
for the prerequisites (vmlinux) and the full host-side
contract.

Read+write workflows then compose naturally: the test pre-seeds
guest state with `bpf_map_write`, lets the scheduler run, and
asserts on the resulting state with `Op::snapshot` + the
[`Snapshot`](#reading-the-captured-report) accessor:

1. **Write (pre-scenario)**`bpf_map_write` flips a `.bss` flag
   the scheduler reads.
2. **Run** — the scenario's ops drive workload behavior; the
   scheduler reacts to the flag.
3. **Read (mid-scenario)**`Op::snapshot("after")` captures the
   scheduler state at the chosen point.
4. **Assert**`Snapshot::var(...).as_u64()` /
   `Snapshot::map(...).find(...).get(...).as_*()` verifies the
   reaction. Errors carry the available alternatives so a typo or
   stale field name surfaces before the test author hand-edits the
   case.

The write side is a single one-shot poke at scheduler-load time;
there is no `Op` variant for runtime writes. Ergonomic mid-scenario
state mutation is reserved for cases where the scheduler itself
exports a writable interface (sysfs, debugfs, BPF map command
interface) and the test invokes that interface from a workload
process.