ktstr 0.5.2

Test harness for Linux process schedulers
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
# Monitor

The monitor observes scheduler state from the host side by reading
guest VM memory directly. It does not instrument the guest kernel or
the scheduler under test.

## What it reads

The monitor resolves kernel structure offsets via BTF (BPF Type Format)
from the guest kernel. It reads per-CPU runqueue structures to extract:

- `nr_running` -- number of runnable tasks on each CPU
- `scx_nr_running` -- tasks managed by the sched_ext scheduler
- `rq_clock` -- runqueue clock value
- `local_dsq_depth` -- scx local dispatch queue depth
- `scx_flags` -- sched_ext flags for each CPU
- scx event counters (fallback, keep-last, offline dispatch,
  skip-exiting, skip-migration-disabled, reenq-immed,
  reenq-local-repeat, refill-slice-dfl, bypass-duration,
  bypass-dispatch, bypass-activate, insert-not-owned,
  sub-bypass-dispatch)

When `CONFIG_SCHEDSTATS` is enabled, the monitor also reads per-CPU
`struct rq` schedstat fields (run_delay, pcount, sched_count,
ttwu_count, etc.).

The monitor walks the `struct sched_domain` tree whenever BTF
contains `rq->sd` and `struct sched_domain` — no `CONFIG_SCHEDSTATS`
required. Domain tree walking starts at `rq->sd` (lowest level) and
follows `sd->parent` pointers up to the root. Each domain level
provides topology metadata (level, name, flags, span_weight) and
runtime fields (balance_interval, nr_balance_failed,
max_newidle_lb_cost) and optional fields (newidle_call,
newidle_success, newidle_ratio — added in 7.0, backported to
6.18.5+ and 6.12.65+; absent on 6.16-6.18.4). When
`CONFIG_SCHEDSTATS` is also enabled, each
domain additionally provides load balancing stats: `lb_count`,
`lb_failed`, `lb_balanced`, `alb_pushed`, `ttwu_wake_remote`, and
other counters indexed by idle type (`CPU_NOT_IDLE`, `CPU_IDLE`,
`CPU_NEWLY_IDLE`).

## Sampling

The monitor takes periodic snapshots (`MonitorSample`) of all per-CPU
state. Each sample captures a point-in-time view of every CPU.

`MonitorSummary` aggregates samples into peak values (max imbalance
ratio, max DSQ depth, stall detection), per-sample averages
(imbalance ratio, nr_running per CPU, DSQ depth per CPU), and event
counter deltas. Averages are computed over valid samples only
(excluding uninitialized guest memory).

## Threshold evaluation

`MonitorThresholds` defines pass/fail conditions:

```rust,ignore
pub struct MonitorThresholds {
    pub max_imbalance_ratio: f64,
    pub max_local_dsq_depth: u32,
    pub fail_on_stall: bool,
    pub sustained_samples: usize,
    pub max_fallback_rate: f64,
    pub max_keep_last_rate: f64,
}
```

A violation must persist for `sustained_samples` consecutive samples
before triggering a failure. This filters transient spikes from cpuset
transitions and cgroup creation/destruction.

### Stall detection

A stall is detected when a CPU's `rq_clock` does not advance between
consecutive samples. Three exemptions prevent false positives:

- **Idle CPUs**: when `nr_running == 0` in both the current and previous
  sample, the CPU has no runnable tasks. The kernel stops the tick
  (NOHZ) on idle CPUs, so `rq_clock` legitimately does not advance.
  These CPUs are excluded from stall checks.

- **Preempted vCPUs**: when the vCPU thread's CPU time did not advance
  past the preemption threshold between samples, the host preempted the
  vCPU. These samples are excluded from stall checks.

- **Sustained window**: stall detection uses per-CPU consecutive
  counters and the `sustained_samples` threshold, matching how
  imbalance and DSQ depth checks work. A single stuck sample does
  not trigger failure -- the stall must persist for `sustained_samples`
  consecutive samples on the same CPU.

## Uninitialized memory detection

Before the guest kernel initializes per-CPU structures, monitor reads
return uninitialized data. Two layers handle this:

- **Summary computation** (`MonitorSummary::from_samples`): skips
  individual samples where any CPU's `local_dsq_depth` exceeds
  `DSQ_PLAUSIBILITY_CEILING` (10,000) via `sample_looks_valid()`.

- **Threshold evaluation** (`MonitorThresholds::evaluate`): checks all
  samples globally for plausibility. If all `rq_clock` values are
  identical across every CPU and sample, or any sample exceeds the
  plausibility ceiling, the entire report is passed as "not yet
  initialized" — no per-threshold checks run.

## BPF map introspection

The monitor module also provides host-side BPF map discovery and
read/write access via `bpf_map::BpfMapAccessor`. The host reads and
writes guest BPF maps directly through the physical memory mapping
— no guest cooperation or BPF syscalls are needed.

### GuestMem

`GuestMem` wraps a host pointer to the start of guest DRAM and
provides bounds-checked volatile reads and writes for scalar types
(u8/u32/u64). Byte-slice reads (`read_bytes`) use
`copy_nonoverlapping`. It also implements x86-64 page table walks
(`translate_kva`) for both 4-level and 5-level paging, and
granule-agnostic aarch64 walks (4 KB / 16 KB / 64 KB; level count
derived from TCR_EL1's TG1 + T1SZ fields).

Scalar accesses use volatile semantics because the guest kernel
modifies memory concurrently.

### GuestKernel

`GuestKernel` builds on `GuestMem` by adding kernel symbol
resolution and address translation. It parses the vmlinux ELF
symbol table at construction and resolves paging configuration
(PAGE_OFFSET, CR3, 4-level vs 5-level) from guest memory.
Subsequent reads use cached state.

Three address translation modes are supported:

- **Text/data/bss**: `kva - __START_KERNEL_map`. For statically-linked
  kernel variables (`read_symbol_*`, `write_symbol_*`).
- **Direct mapping**: `kva - PAGE_OFFSET`. For SLAB allocations,
  per-CPU data, physically contiguous memory (`read_direct_*`).
- **Vmalloc/vmap**: Page table walk via CR3. For BPF maps, vmalloc'd
  memory, module text (`read_kva_*`, `write_kva_*`).

### BpfMapAccessor

`BpfMapAccessor` resolves BTF offsets for BPF map kernel structures
(`struct bpf_map`, `struct bpf_array`, `struct xa_node`, `struct idr`)
and provides map discovery and value read/write. It borrows a
`GuestKernel` for address translation.

`BpfMapAccessorOwned` is a convenience wrapper that owns the
`GuestKernel` internally. Use `BpfMapAccessor::from_guest_kernel`
when you already have a `GuestKernel`; use `BpfMapAccessorOwned::new`
when you want a self-contained accessor.

Map discovery walks the kernel's `map_idr` xarray:

1. Read `map_idr` (BSS symbol, text mapping translation)
2. Walk xa_node tree (SLAB-allocated, direct mapping translation)
3. Read `struct bpf_map` fields. The allocation may be kmalloc'd or
   vmalloc'd depending on size and flags, so the translation uses
   `translate_any_kva` which handles both paths rather than assuming
   either.

`find_map` searches by **name suffix** (e.g. `".bss"` matches
`"mitosis.bss"`). Only `BPF_MAP_TYPE_ARRAY` maps are returned.
Use `maps()` to enumerate all map types without filtering.

Value access for `BPF_MAP_TYPE_ARRAY` maps reads/writes the inline
`bpf_array.value` flex array at the BTF-resolved offset. The value
region is vmalloc'd, so each byte access goes through the page table
walker to handle page boundaries.

For `BPF_MAP_TYPE_PERCPU_ARRAY` maps, `bpf_array.pptrs[key]` holds
a `__percpu` pointer (at the same union offset as `value`). Adding
`__per_cpu_offset[cpu]` yields the per-CPU KVA in the direct mapping.
`read_percpu_array` returns one `Option<Vec<u8>>` per CPU: `Some`
when the per-CPU PA falls within guest memory, `None` when it does not.

### Typed field access

When a map has BTF metadata (`btf_kva != 0`), `resolve_value_layout`
reads the guest's `struct btf` and its `data` blob, parses it with
`btf_rs`, and resolves the value struct's fields. This enables
`read_field` / `write_field` with type-checked `BpfValue` variants.

### Usage example

Find a scheduler's `.bss` map and write a crash variable:

```rust,ignore
let offsets = BpfMapOffsets::from_vmlinux(vmlinux)?;
let accessor = BpfMapAccessor::from_guest_kernel(&kernel, &offsets)?;
let bss = accessor.find_map(".bss").expect(".bss map not found");
accessor.write_value_u32(&bss, crash_offset, 1);
```

### BpfMapWrite

`BpfMapWrite` specifies a host-side write to a BPF map during VM
execution. The test runner waits for the scheduler to load (map
becomes discoverable), writes the value, then signals the guest via
SHM to start the scenario.

```rust,ignore
pub struct BpfMapWrite {
    pub map_name_suffix: &'static str,  // e.g. ".bss"
    pub offset: usize,                  // byte offset in the map value
    pub value: u32,                     // value to write
}
```

Use with `#[ktstr_test]` via the `bpf_map_write` attribute:

```rust,ignore
const BPF_CRASH: BpfMapWrite = BpfMapWrite {
    map_name_suffix: ".bss",
    offset: 42,
    value: 1,
};

#[ktstr_test(bpf_map_write = BPF_CRASH, expect_err = true)]
fn crash_test(ctx: &Ctx) -> Result<AssertResult> {
    Ok(AssertResult::pass())
}
```

The map is discovered by name suffix via `BpfMapAccessor::find_map`.
Only `BPF_MAP_TYPE_ARRAY` maps are supported. The write targets a
u32 at the specified byte offset within the map's value region.

### Prerequisites

- **vmlinux**: Required for ELF symbols and BTF. Must match the guest
  kernel. Symbols include `phys_base` so the runtime KASLR offset can
  be resolved via a page-table walk through the BSP's CR3, breaking
  the chicken-and-egg between text-symbol PA translation and KASLR.

### Cast analysis

BPF maps frequently store kernel pointers (`task_struct *`,
`cgroup *`, …) and arena pointers in `u64` fields because BTF cannot
express a pointer to a per-allocation type. Without intervention the
renderer treats them as integers and the failure dump shows raw
0xffff…ffff values with no further chase.

The cast analyzer (`monitor::cast_analysis::analyze_casts`) closes
that gap. The freeze coordinator runs it once per scheduler load,
before any periodic capture or on-demand snapshot would consume
its output:

1. The host loads the scheduler binary and locates each `.bpf.o`
   ELF in the build artifacts.
2. Each program section is decoded through
   `cast_analysis::BpfInsn::from_le_bytes` into a flat `&[BpfInsn]`
   slab; relocations against `.bss` / `.data` / `.rodata` annotate
   the corresponding `BPF_LD_IMM64` PCs with their datasec target.
3. `analyze_casts` walks the slab forward, tracking register and
   stack-slot state for each instruction. Two detection paths feed
   the output: the arena pointer path (LDX through a previously
   loaded `u64` field) and the kernel kptr path (STX of a typed
   pointer register into a `u64` field). Function-entry seeding
   from `bpf_func_info` reseeds R1..R5 from the BTF FuncProto so
   typed parameters propagate correctly across subprogram joins.
4. The result is a `CastMap` (`BTreeMap<(source_struct_btf_id,
   field_byte_offset), CastHit>`) cached on the per-VM
   `KtstrVm.cast_map` (a `LazyCastMap` that runs the analyzer on
   first dump and caches the result process-wide by scheduler
   binary content hash). The freeze coordinator threads the cached
   `CastMap` through `DumpContext::cast_map` into every per-map
   render so the renderer can consult it at every dump site.
5. `render_cast_pointer` in `monitor::btf_render` consumes
   `CastHit` via `MemReader::cast_lookup`. When a `u64` field at a
   recorded `(struct, offset)` is rendered, the renderer chases
   the pointer through the address-space-appropriate reader (arena
   vs slab/vmalloc) and tags the result with a `cast_annotation`
   of `"cast→arena"` or `"cast→kernel"` (plus a `(sdt_alloc)`
   suffix when the bridge described below fired). Failure dumps
   show the annotation alongside the resolved struct fields, so
   cast-recovered pointers are visually distinct from BTF-typed
   ones.

The renderer also consults an `sdt_alloc` bridge whenever a chase
target peels to a `BTF_KIND_FWD` forward declaration (typical for
`struct sdt_data __arena *` fields whose body lives in the
sdt_alloc library's BTF rather than the scheduler's program BTF).
The dump-state pre-pass walks each live `scx_allocator` and
populates a `slot_start → ArenaSlotInfo` index — one entry
per live allocator slot, carrying `elem_size`, `header_size`, and
the resolved payload BTF type id — that
`MemReader::resolve_arena_type` (in
`dump::render_map::AccessorMemReader`) range-looks up during the
chase. The lookup finds the slot whose
`[slot_start, slot_start + elem_size)` range contains the chased
address and routes by `offset_in_slot`: a slot-start chase
(`offset == 0`, e.g. the `data` field of `scx_task_map_val`
storing the raw `sdt_alloc()` return) returns the payload type id
with `header_skip = header_size`; a payload-start chase
(`offset == header_size`, e.g. the return of `scx_task_data(p)`
cached in `cached_taskc_raw`) returns the same payload type id
with `header_skip = 0`. The renderer reads `header_skip + btf_size`
bytes from the chased address, slices off the leading
`header_skip` bytes, and renders the payload struct. The
resulting `Ptr` carries a `sdt_alloc`-flavoured annotation:
`"sdt_alloc"` on the BTF-typed `Type::Ptr` arm, and
`"cast→arena (sdt_alloc)"` / `"cast→kernel (sdt_alloc)"` on the
cast-analyzer-driven path. The `sdt_alloc` bridge fires only when
the BTF-only resolve has already exhausted same-name siblings;
false-positive risk on that arm is bounded by the arena-window
range check (`MemReader::resolve_arena_type` returns `None` for
addresses outside every known allocator slot).

A separate cross-BTF Fwd resolution path covers the case where a
`BTF_KIND_FWD` pointee's body lives in a sibling embedded BPF
object's BTF rather than an `sdt_alloc` slot — the typical
multi-`.bpf.objs` shape where one object declares
`struct cgx_target;` (forward) and a sibling object defines
`struct cgx_target { ... }` (full body). The cast-analysis
pre-pass (`vmm::cast_analysis_load::build_fwd_index`) walks every
parsed embedded program BTF and records a
`name -> (btfs index, type_id)` entry for every complete
(`!is_fwd`) `Type::Struct` / `Type::Union`. First-write-wins on
duplicate names: when the same name appears in multiple BTFs the
index keeps the first-seen entry. Anonymous types and `Typedef`
are not indexed (no name to key on, and typedefs add no body —
the chase peels through them via `peel_modifiers_with_id` before
consulting the index). The index is threaded through
`DumpContext::cross_btf` and exposed to the renderer via
`MemReader::cross_btf_resolve_fwd`. When `chase_arena_pointer` /
`render_cast_pointer` peel a chase target through
`peel_modifiers_resolving_fwd` and the local same-BTF sibling
search came up empty, `try_cross_btf_fwd_resolve` consults the
cross-BTF index by the Fwd's name (and aggregate kind — `struct`
vs `union`); a hit returns a `CrossBtfRef { btf, type_id }` and
the chase recursion switches to the resolved sibling BTF for the
pointee render. Cross-BTF resolution does NOT introduce a new
annotation — the body is recovered transparently and the rendered
subtree carries the cast or BTF-typed annotation it would have
had if the same struct lived in the entry BTF. Unlike the
`sdt_alloc` bridge the cross-BTF index is consulted whenever a
Fwd terminal survives the local resolve — there is no
arena-window gate, since the lookup is purely a name-keyed BTF
table and a name miss simply leaves the chase on its existing
"forward declaration; body not in this BTF" skip path.

The analyzer is deliberately conservative: branch joins reset
register and stack state, conflicts drop the offending entry, and
self-stores are rejected. False negatives fall back to raw `u64`
(the prior behavior); false positives would chase garbage and are
avoided. The analysis is unconditional — no test-author
configuration, no opt-in flag — and the freeze coordinator wires
the resulting `CastMap` through every snapshot, periodic capture,
and failure dump.

## Probe pipeline

The probe pipeline captures function arguments and struct fields during
auto-repro. It operates inside the guest VM (not from the host), using
two BPF skeletons that share maps.

### Architecture

```text
crash stack -> extract functions -> BTF resolve -> load skeletons -> poll
                                                         |
                                    kprobe skeleton      |     fentry/fexit skeleton
                                    (kernel entry)       |     (BPF entry + kernel exit)
                                         |               |          |
                                         v               v          v
                                    func_meta_map  <--shared-->  probe_data
                                                         |        (entry + exit fields)
                                              trigger fires (ring buffer)
                                                         |
                                              read probe_data entries
                                                         |
                                              stitch by tptr
                                                         |
                                              format with entry→exit diffs
```

### Kprobe skeleton (`probe.bpf.c`)

Attaches to kernel functions via `attach_kprobe`. The BPF handler:
1. Gets the function IP via `bpf_get_func_ip`
2. Looks up `func_meta` from `func_meta_map` (keyed by IP)
3. Captures 6 raw args from `pt_regs`
4. Dereferences struct fields via BTF-resolved offsets
5. Reads char * string params if configured
6. Stores result in `probe_data` (keyed by `(func_ip, task_ptr)`)

The trigger fires via `tp_btf/sched_ext_exit` (inside
`scx_claim_exit()`) and sends an `EVENT_TRIGGER` via ring buffer
with the current task pointer and kernel stack.

### Fentry/fexit skeleton (`fentry_probe.bpf.c`)

Handles both BPF struct_ops callbacks and kernel function exit
capture. Loaded in batches of 4 fentry + 4 fexit programs per
skeleton instance via `set_attach_target`. Shares `probe_data` and
`func_meta_map` with the kprobe skeleton via `reuse_fd`.

A per-slot `is_kernel` rodata flag controls argument access:
- **BPF callbacks** (`is_kernel=0`): `ctx[0]` is a void pointer to
  the real callback arguments. The handler dereferences through it.
  Uses sentinel IPs (`func_idx | (1<<63)`) in `func_meta_map`.
- **Kernel functions** (`is_kernel=1`): args are directly in
  `ctx[0..5]`. Uses `bpf_get_func_ip(ctx)` for the real IP,
  matching the kprobe entry handler's key.

Fexit handlers look up the existing `probe_data` entry (written by
fentry or kprobe at function entry) and re-read struct fields into
`exit_fields`. This captures post-mutation state for paired display.

### BTF resolution

Two BTF sources:

- **vmlinux BTF** (`btf-rs`): resolves kernel struct offsets. Types in
  `STRUCT_FIELDS` (task_struct, rq, scx_dispatch_q, etc.) use curated
  field lists with chained pointer dereferences (e.g.
  `->cpus_ptr->bits[0]`). Other struct pointer params get scalar, enum,
  and cpumask pointer fields auto-discovered from vmlinux BTF.

- **Program BTF** (`libbpf-rs`): resolves BPF-local struct offsets for
  types not in vmlinux (e.g. scheduler-defined `task_ctx`).
  Auto-discovers scalar, enum, and cpumask pointer fields.

Callback signatures are resolved by:
1. `____name` inner function in program BTF (typed params)
2. `sched_ext_ops` member in vmlinux BTF (fallback)
3. Wrapper function (void *ctx, no useful params)

### Field decoding

The output formatter decodes field values based on their key name:
- `dsq_id` -> `SCX_DSQ_INVALID`, `SCX_DSQ_GLOBAL`, `SCX_DSQ_LOCAL`, `SCX_DSQ_BYPASS`, `SCX_DSQ_LOCAL_ON|{cpu}`, `BUILTIN({v})`, `DSQ(0x{hex})`
- `cpumask_0..3` -> coalesced into one `cpus_ptr` field rendered as
  `0x{hex}({cpu-list})` — the masked hex of the cpumask words
  (low-order word first; multi-word masks join with `_` between
  64-bit chunks) followed by the run-length-collapsed CPU range
  list (e.g. `0xf(0-3)`, `0x1_00000000000000ff(0-7,64)`)
- `enq_flags` -> `WAKEUP|HEAD|PREEMPT`
- `exit_kind` -> `ERROR`, `ERROR_BPF`, `ERROR_STALL`, etc.
- `scx_flags` -> `QUEUED|ENABLED`
- `sticky_cpu` -> `-1` for 0xffffffff

### Event stitching

After the trigger fires, all `probe_data` entries are read, matched
to functions by IP, then filtered to a single task's scheduling
journey:

1. Read the task_struct pointer from the trigger event's
   `bpf_get_current_task()` value (`args[0]`)
2. For functions with a task_struct parameter: keep events where
   `args[param_idx] == tptr`
3. For functions without a task_struct parameter: keep events where
   `task_ptr == tptr` (matched via `bpf_get_current_task()` at
   probe time)

Events are sorted by timestamp for chronological output.