ktstr 0.3.2

Test harness for Linux process schedulers
# Monitor

The monitor observes scheduler state from the host side by reading
guest VM memory directly. It does not instrument the guest kernel or
the scheduler under test.

## What it reads

The monitor resolves kernel structure offsets via BTF (BPF Type Format)
from the guest kernel. It reads per-CPU runqueue structures to extract:

- `nr_running` -- number of runnable tasks on each CPU
- `scx_nr_running` -- tasks managed by the sched_ext scheduler
- `rq_clock` -- runqueue clock value
- `local_dsq_depth` -- scx local dispatch queue depth
- `scx_flags` -- sched_ext flags for each CPU
- scx event counters (fallback, keep-last, offline dispatch,
  skip-exiting, skip-migration-disabled)

When `CONFIG_SCHEDSTATS` is enabled, the monitor also reads per-CPU
`struct rq` schedstat fields (run_delay, pcount, sched_count,
ttwu_count, etc.).

The monitor walks the `struct sched_domain` tree whenever BTF
contains `rq->sd` and `struct sched_domain` — no `CONFIG_SCHEDSTATS`
required. Domain tree walking starts at `rq->sd` (lowest level) and
follows `sd->parent` pointers up to the root. Each domain level
provides topology metadata (level, name, flags, span_weight) and
runtime fields (balance_interval, nr_balance_failed, newidle_call,
newidle_success, newidle_ratio, max_newidle_lb_cost). When
`CONFIG_SCHEDSTATS` is also enabled, each
domain additionally provides load balancing stats: `lb_count`,
`lb_failed`, `lb_balanced`, `alb_pushed`, `ttwu_wake_remote`, and
other counters indexed by idle type (`CPU_NOT_IDLE`, `CPU_IDLE`,
`CPU_NEWLY_IDLE`).

## Sampling

The monitor takes periodic snapshots (`MonitorSample`) of all per-CPU
state. Each sample captures a point-in-time view of every CPU.

`MonitorSummary` aggregates samples into peak values (max imbalance
ratio, max DSQ depth, stall detection), per-sample averages
(imbalance ratio, nr_running per CPU, DSQ depth per CPU), and event
counter deltas. Averages are computed over valid samples only
(excluding uninitialized guest memory).

## Threshold evaluation

`MonitorThresholds` defines pass/fail conditions:

```rust,ignore
pub struct MonitorThresholds {
    pub max_imbalance_ratio: f64,
    pub max_local_dsq_depth: u32,
    pub fail_on_stall: bool,
    pub sustained_samples: usize,
    pub max_fallback_rate: f64,
    pub max_keep_last_rate: f64,
}
```

A violation must persist for `sustained_samples` consecutive samples
before triggering a failure. This filters transient spikes from cpuset
transitions and cgroup creation/destruction.

### Stall detection

A stall is detected when a CPU's `rq_clock` does not advance between
consecutive samples. Three exemptions prevent false positives:

- **Idle CPUs**: when `nr_running == 0` in both the current and previous
  sample, the CPU has no runnable tasks. The kernel stops the tick
  (NOHZ) on idle CPUs, so `rq_clock` legitimately does not advance.
  These CPUs are excluded from stall checks.

- **Preempted vCPUs**: when the vCPU thread's CPU time did not advance
  past the preemption threshold between samples, the host preempted the
  vCPU. These samples are excluded from stall checks.

- **Sustained window**: stall detection uses per-CPU consecutive
  counters and the `sustained_samples` threshold, matching how
  imbalance and DSQ depth checks work. A single stuck sample does
  not trigger failure -- the stall must persist for `sustained_samples`
  consecutive samples on the same CPU.

## Uninitialized memory detection

Before the guest kernel initializes per-CPU structures, monitor reads
return uninitialized data. Two layers handle this:

- **Summary computation** (`MonitorSummary::from_samples`): skips
  individual samples where any CPU's `local_dsq_depth` exceeds
  `DSQ_PLAUSIBILITY_CEILING` (10,000) via `sample_looks_valid()`.

- **Threshold evaluation** (`MonitorThresholds::evaluate`): checks all
  samples globally via `data_looks_valid()`. If all `rq_clock` values
  are identical across every CPU and sample, or any sample exceeds the
  plausibility ceiling, the entire report is passed as "not yet
  initialized" — no per-threshold checks run.

## BPF map introspection

The monitor module also provides host-side BPF map discovery and
read/write access via `bpf_map::BpfMapAccessor`. The host reads and
writes guest BPF maps directly through the physical memory mapping
— no guest cooperation or BPF syscalls are needed.

### GuestMem

`GuestMem` wraps a host pointer to the start of guest DRAM and
provides bounds-checked volatile reads and writes for scalar types
(u8/u32/u64). Byte-slice reads (`read_bytes`) use
`copy_nonoverlapping`. It also implements x86-64 page table walks
(`translate_kva`) for both 4-level and 5-level paging, and 3-level
aarch64 walks (64KB granule).

Scalar accesses use volatile semantics because the guest kernel
modifies memory concurrently.

### GuestKernel

`GuestKernel` builds on `GuestMem` by adding kernel symbol
resolution and address translation. It parses the vmlinux ELF
symbol table at construction and resolves paging configuration
(PAGE_OFFSET, CR3, 4-level vs 5-level) from guest memory.
Subsequent reads use cached state.

Three address translation modes are supported:

- **Text/data/bss**: `kva - __START_KERNEL_map`. For statically-linked
  kernel variables (`read_symbol_*`, `write_symbol_*`).
- **Direct mapping**: `kva - PAGE_OFFSET`. For SLAB allocations,
  per-CPU data, physically contiguous memory (`read_direct_*`).
- **Vmalloc/vmap**: Page table walk via CR3. For BPF maps, vmalloc'd
  memory, module text (`read_kva_*`, `write_kva_*`).

### BpfMapAccessor

`BpfMapAccessor` resolves BTF offsets for BPF map kernel structures
(`struct bpf_map`, `struct bpf_array`, `struct xa_node`, `struct idr`)
and provides map discovery and value read/write. It borrows a
`GuestKernel` for address translation.

`BpfMapAccessorOwned` is a convenience wrapper that owns the
`GuestKernel` internally. Use `BpfMapAccessor::from_guest_kernel`
when you already have a `GuestKernel`; use `BpfMapAccessorOwned::new`
when you want a self-contained accessor.

Map discovery walks the kernel's `map_idr` xarray:

1. Read `map_idr` (BSS symbol, text mapping translation)
2. Walk xa_node tree (SLAB-allocated, direct mapping translation)
3. Read `struct bpf_map` fields (vmalloc'd, page table walk)

`find_map` searches by **name suffix** (e.g. `".bss"` matches
`"mitosis.bss"`). Only `BPF_MAP_TYPE_ARRAY` maps are returned.
Use `maps()` to enumerate all map types without filtering.

Value access for `BPF_MAP_TYPE_ARRAY` maps reads/writes the inline
`bpf_array.value` flex array at the BTF-resolved offset. The value
region is vmalloc'd, so each byte access goes through the page table
walker to handle page boundaries.

For `BPF_MAP_TYPE_PERCPU_ARRAY` maps, `bpf_array.pptrs[key]` holds
a `__percpu` pointer (at the same union offset as `value`). Adding
`__per_cpu_offset[cpu]` yields the per-CPU KVA in the direct mapping.
`read_percpu_array` returns one `Option<Vec<u8>>` per CPU: `Some`
when the per-CPU PA falls within guest memory, `None` when it does not.

### Typed field access

When a map has BTF metadata (`btf_kva != 0`), `resolve_value_layout`
reads the guest's `struct btf` and its `data` blob, parses it with
`btf_rs`, and resolves the value struct's fields. This enables
`read_field` / `write_field` with type-checked `BpfValue` variants.

### Usage example

Find a scheduler's `.bss` map and write a crash variable:

```rust,ignore
let accessor = BpfMapAccessor::from_guest_kernel(&kernel, vmlinux)?;
let bss = accessor.find_map(".bss").expect(".bss map not found");
accessor.write_value_u32(&bss, crash_offset, 1);
```

### BpfMapWrite

`BpfMapWrite` specifies a host-side write to a BPF map during VM
execution. The test runner waits for the scheduler to load (map
becomes discoverable), writes the value, then signals the guest via
SHM to start the scenario.

```rust,ignore
pub struct BpfMapWrite {
    pub map_name_suffix: &'static str,  // e.g. ".bss"
    pub offset: usize,                  // byte offset in the map value
    pub value: u32,                     // value to write
}
```

Use with `#[ktstr_test]` via the `bpf_map_write` attribute:

```rust,ignore
const BPF_CRASH: BpfMapWrite = BpfMapWrite {
    map_name_suffix: ".bss",
    offset: 42,
    value: 1,
};

#[ktstr_test(bpf_map_write = BPF_CRASH, expect_err = true)]
fn crash_test(ctx: &Ctx) -> Result<AssertResult> {
    Ok(AssertResult::pass())
}
```

The map is discovered by name suffix via `BpfMapAccessor::find_map`.
Only `BPF_MAP_TYPE_ARRAY` maps are supported. The write targets a
u32 at the specified byte offset within the map's value region.

### Prerequisites

- **vmlinux**: Required for ELF symbols and BTF. Must match the guest
  kernel.
- **nokaslr**: Required on the guest command line. Text mapping
  translation assumes `phys_base = 0`.

## Probe pipeline

The probe pipeline captures function arguments and struct fields during
auto-repro. It operates inside the guest VM (not from the host), using
two BPF skeletons that share maps.

### Architecture

```text
crash stack -> extract functions -> BTF resolve -> load skeletons -> poll
                                                         |
                                    kprobe skeleton      |     fentry/fexit skeleton
                                    (kernel entry)       |     (BPF + kernel exit)
                                         |               |          |
                                         v               v          v
                                    func_meta_map  <--shared-->  probe_data
                                                         |        (entry + exit fields)
                                              trigger fires (ring buffer)
                                                         |
                                              read probe_data entries
                                                         |
                                              stitch by tptr
                                                         |
                                              format with entry→exit diffs
```

### Kprobe skeleton (`probe.bpf.c`)

Attaches to kernel functions via `attach_kprobe`. The BPF handler:
1. Gets the function IP via `bpf_get_func_ip`
2. Looks up `func_meta` from `func_meta_map` (keyed by IP)
3. Captures 6 raw args from `pt_regs`
4. Dereferences struct fields via BTF-resolved offsets
5. Reads char * string params if configured
6. Stores result in `probe_data` (keyed by `(func_ip, task_ptr)`)

The trigger fires via `tp_btf/sched_ext_exit` (inside
`scx_claim_exit()`) and sends an `EVENT_TRIGGER` via ring buffer
with the current task pointer and kernel stack.

### Fentry/fexit skeleton (`fentry_probe.bpf.c`)

Handles both BPF struct_ops callbacks and kernel function exit
capture. Loaded in batches of 4 fentry + 4 fexit programs per
skeleton instance via `set_attach_target`. Shares `probe_data` and
`func_meta_map` with the kprobe skeleton via `reuse_fd`.

A per-slot `is_kernel` rodata flag controls argument access:
- **BPF callbacks** (`is_kernel=0`): `ctx[0]` is a void pointer to
  the real callback arguments. The handler dereferences through it.
  Uses sentinel IPs (`func_idx | (1<<63)`) in `func_meta_map`.
- **Kernel functions** (`is_kernel=1`): args are directly in
  `ctx[0..5]`. Uses `bpf_get_func_ip(ctx)` for the real IP,
  matching the kprobe entry handler's key.

Fexit handlers look up the existing `probe_data` entry (written by
fentry or kprobe at function entry) and re-read struct fields into
`exit_fields`. This captures post-mutation state for paired display.

### BTF resolution

Two BTF sources:

- **vmlinux BTF** (`btf-rs`): resolves kernel struct offsets. Types in
  `STRUCT_FIELDS` (task_struct, rq, scx_dispatch_q, etc.) use curated
  field lists with chained pointer dereferences (e.g.
  `->cpus_ptr->bits[0]`). Other struct pointer params get scalar, enum,
  and cpumask pointer fields auto-discovered from vmlinux BTF.

- **Program BTF** (`libbpf-rs`): resolves BPF-local struct offsets for
  types not in vmlinux (e.g. scheduler-defined `task_ctx`).
  Auto-discovers scalar, enum, and cpumask pointer fields.

Callback signatures are resolved by:
1. `____name` inner function in program BTF (typed params)
2. `sched_ext_ops` member in vmlinux BTF (fallback)
3. Wrapper function (void *ctx, no useful params)

### Field decoding

The output formatter decodes field values based on their key name:
- `dsq_id` -> `SCX_DSQ_INVALID`, `SCX_DSQ_GLOBAL`, `SCX_DSQ_LOCAL`, `SCX_DSQ_BYPASS`, `SCX_DSQ_LOCAL_ON|{cpu}`, `BUILTIN({v})`, `DSQ(0x{hex})`
- `cpumask_0..3` -> coalesced `cpus_ptr 0xf(0-3)`
- `enq_flags` -> `WAKEUP|HEAD|PREEMPT`
- `exit_kind` -> `ERROR`, `ERROR_BPF`, `ERROR_STALL`, etc.
- `scx_flags` -> `QUEUED|ENABLED`
- `sticky_cpu` -> `-1` for 0xffffffff

### Event stitching

After the trigger fires, all `probe_data` entries are read, matched
to functions by IP, then filtered to a single task's scheduling
journey:

1. Read the task_struct pointer from the trigger event's
   `bpf_get_current_task()` value (`args[0]`)
2. For functions with a task_struct parameter: keep events where
   `args[param_idx] == tptr`
3. For functions without a task_struct parameter: keep events where
   `task_ptr == tptr` (matched via `bpf_get_current_task()` at
   probe time)

Events are sorted by timestamp for chronological output.