# Monitor
The monitor observes scheduler state from the host side by reading
guest VM memory directly. It does not instrument the guest kernel or
the scheduler under test.
## What it reads
The monitor resolves kernel structure offsets via BTF (BPF Type Format)
from the guest kernel. It reads per-CPU runqueue structures to extract:
- `nr_running` -- number of runnable tasks on each CPU
- `scx_nr_running` -- tasks managed by the sched_ext scheduler
- `rq_clock` -- runqueue clock value
- `local_dsq_depth` -- scx local dispatch queue depth
- `scx_flags` -- sched_ext flags for each CPU
- scx event counters (fallback, keep-last, offline dispatch,
skip-exiting, skip-migration-disabled, reenq-immed,
reenq-local-repeat, refill-slice-dfl, bypass-duration,
bypass-dispatch, bypass-activate, insert-not-owned,
sub-bypass-dispatch)
When `CONFIG_SCHEDSTATS` is enabled, the monitor also reads per-CPU
`struct rq` schedstat fields (run_delay, pcount, sched_count,
ttwu_count, etc.).
The monitor walks the `struct sched_domain` tree whenever BTF
contains `rq->sd` and `struct sched_domain` — no `CONFIG_SCHEDSTATS`
required. Domain tree walking starts at `rq->sd` (lowest level) and
follows `sd->parent` pointers up to the root. Each domain level
provides topology metadata (level, name, flags, span_weight) and
runtime fields (balance_interval, nr_balance_failed,
max_newidle_lb_cost) and optional fields (newidle_call,
newidle_success, newidle_ratio — added in 7.0, backported to
6.18.5+ and 6.12.65+; absent on 6.16-6.18.4). When
`CONFIG_SCHEDSTATS` is also enabled, each
domain additionally provides load balancing stats: `lb_count`,
`lb_failed`, `lb_balanced`, `alb_pushed`, `ttwu_wake_remote`, and
other counters indexed by idle type (`CPU_NOT_IDLE`, `CPU_IDLE`,
`CPU_NEWLY_IDLE`).
## Sampling
The monitor takes periodic snapshots (`MonitorSample`) of all per-CPU
state. Each sample captures a point-in-time view of every CPU.
`MonitorSummary` aggregates samples into peak values (max imbalance
ratio, max DSQ depth, stall detection), per-sample averages
(imbalance ratio, nr_running per CPU, DSQ depth per CPU), and event
counter deltas. Averages are computed over valid samples only
(excluding uninitialized guest memory).
## Threshold evaluation
`MonitorThresholds` defines pass/fail conditions:
```rust,ignore
pub struct MonitorThresholds {
pub max_imbalance_ratio: f64,
pub max_local_dsq_depth: u32,
pub fail_on_stall: bool,
pub sustained_samples: usize,
pub max_fallback_rate: f64,
pub max_keep_last_rate: f64,
}
```
A violation must persist for `sustained_samples` consecutive samples
before triggering a failure. This filters transient spikes from cpuset
transitions and cgroup creation/destruction.
### Stall detection
A stall is detected when a CPU's `rq_clock` does not advance between
consecutive samples. Three exemptions prevent false positives:
- **Idle CPUs**: when `nr_running == 0` in both the current and previous
sample, the CPU has no runnable tasks. The kernel stops the tick
(NOHZ) on idle CPUs, so `rq_clock` legitimately does not advance.
These CPUs are excluded from stall checks.
- **Preempted vCPUs**: when the vCPU thread's CPU time did not advance
past the preemption threshold between samples, the host preempted the
vCPU. These samples are excluded from stall checks.
- **Sustained window**: stall detection uses per-CPU consecutive
counters and the `sustained_samples` threshold, matching how
imbalance and DSQ depth checks work. A single stuck sample does
not trigger failure -- the stall must persist for `sustained_samples`
consecutive samples on the same CPU.
## Uninitialized memory detection
Before the guest kernel initializes per-CPU structures, monitor reads
return uninitialized data. Two layers handle this:
- **Summary computation** (`MonitorSummary::from_samples`): skips
individual samples where any CPU's `local_dsq_depth` exceeds
`DSQ_PLAUSIBILITY_CEILING` (10,000) via `sample_looks_valid()`.
- **Threshold evaluation** (`MonitorThresholds::evaluate`): checks all
samples globally for plausibility. If all `rq_clock` values are
identical across every CPU and sample, or any sample exceeds the
plausibility ceiling, the entire report is passed as "not yet
initialized" — no per-threshold checks run.
## BPF map introspection
The monitor module also provides host-side BPF map discovery and
read/write access via `bpf_map::BpfMapAccessor`. The host reads and
writes guest BPF maps directly through the physical memory mapping
— no guest cooperation or BPF syscalls are needed.
### GuestMem
`GuestMem` wraps a host pointer to the start of guest DRAM and
provides bounds-checked volatile reads and writes for scalar types
(u8/u32/u64). Byte-slice reads (`read_bytes`) use
`copy_nonoverlapping`. It also implements x86-64 page table walks
(`translate_kva`) for both 4-level and 5-level paging, and
granule-agnostic aarch64 walks (4 KB / 16 KB / 64 KB; level count
derived from TCR_EL1's TG1 + T1SZ fields).
Scalar accesses use volatile semantics because the guest kernel
modifies memory concurrently.
### GuestKernel
`GuestKernel` builds on `GuestMem` by adding kernel symbol
resolution and address translation. It parses the vmlinux ELF
symbol table at construction and resolves paging configuration
(PAGE_OFFSET, CR3, 4-level vs 5-level) from guest memory.
Subsequent reads use cached state.
Three address translation modes are supported:
- **Text/data/bss**: `kva - __START_KERNEL_map`. For statically-linked
kernel variables (`read_symbol_*`, `write_symbol_*`).
- **Direct mapping**: `kva - PAGE_OFFSET`. For SLAB allocations,
per-CPU data, physically contiguous memory (`read_direct_*`).
- **Vmalloc/vmap**: Page table walk via CR3. For BPF maps, vmalloc'd
memory, module text (`read_kva_*`, `write_kva_*`).
### BpfMapAccessor
`BpfMapAccessor` resolves BTF offsets for BPF map kernel structures
(`struct bpf_map`, `struct bpf_array`, `struct xa_node`, `struct idr`)
and provides map discovery and value read/write. It borrows a
`GuestKernel` for address translation.
`BpfMapAccessorOwned` is a convenience wrapper that owns the
`GuestKernel` internally. Use `BpfMapAccessor::from_guest_kernel`
when you already have a `GuestKernel`; use `BpfMapAccessorOwned::new`
when you want a self-contained accessor.
Map discovery walks the kernel's `map_idr` xarray:
1. Read `map_idr` (BSS symbol, text mapping translation)
2. Walk xa_node tree (SLAB-allocated, direct mapping translation)
3. Read `struct bpf_map` fields. The allocation may be kmalloc'd or
vmalloc'd depending on size and flags, so the translation uses
`translate_any_kva` which handles both paths rather than assuming
either.
`find_map` searches by **name suffix** (e.g. `".bss"` matches
`"mitosis.bss"`). Only `BPF_MAP_TYPE_ARRAY` maps are returned.
Use `maps()` to enumerate all map types without filtering.
Value access for `BPF_MAP_TYPE_ARRAY` maps reads/writes the inline
`bpf_array.value` flex array at the BTF-resolved offset. The value
region is vmalloc'd, so each byte access goes through the page table
walker to handle page boundaries.
For `BPF_MAP_TYPE_PERCPU_ARRAY` maps, `bpf_array.pptrs[key]` holds
a `__percpu` pointer (at the same union offset as `value`). Adding
`__per_cpu_offset[cpu]` yields the per-CPU KVA in the direct mapping.
`read_percpu_array` returns one `Option<Vec<u8>>` per CPU: `Some`
when the per-CPU PA falls within guest memory, `None` when it does not.
### Typed field access
When a map has BTF metadata (`btf_kva != 0`), `resolve_value_layout`
reads the guest's `struct btf` and its `data` blob, parses it with
`btf_rs`, and resolves the value struct's fields. This enables
`read_field` / `write_field` with type-checked `BpfValue` variants.
### Usage example
Find a scheduler's `.bss` map and write a crash variable:
```rust,ignore
let offsets = BpfMapOffsets::from_vmlinux(vmlinux)?;
let accessor = BpfMapAccessor::from_guest_kernel(&kernel, &offsets)?;
let bss = accessor.find_map(".bss").expect(".bss map not found");
accessor.write_value_u32(&bss, crash_offset, 1);
```
### BpfMapWrite
`BpfMapWrite` specifies a host-side write to a BPF map during VM
execution. The test runner waits for the scheduler to load (map
becomes discoverable), writes the value, then signals the guest via
SHM to start the scenario.
```rust,ignore
pub struct BpfMapWrite {
pub map_name_suffix: &'static str, // e.g. ".bss"
pub offset: usize, // byte offset in the map value
pub value: u32, // value to write
}
```
Use with `#[ktstr_test]` via the `bpf_map_write` attribute:
```rust,ignore
const BPF_CRASH: BpfMapWrite = BpfMapWrite {
map_name_suffix: ".bss",
offset: 42,
value: 1,
};
#[ktstr_test(bpf_map_write = BPF_CRASH, expect_err = true)]
fn crash_test(ctx: &Ctx) -> Result<AssertResult> {
Ok(AssertResult::pass())
}
```
The map is discovered by name suffix via `BpfMapAccessor::find_map`.
Only `BPF_MAP_TYPE_ARRAY` maps are supported. The write targets a
u32 at the specified byte offset within the map's value region.
### Prerequisites
- **vmlinux**: Required for ELF symbols and BTF. Must match the guest
kernel. Symbols include `phys_base` so the runtime KASLR offset can
be resolved via a page-table walk through the BSP's CR3, breaking
the chicken-and-egg between text-symbol PA translation and KASLR.
### Cast analysis
BPF maps frequently store kernel pointers (`task_struct *`,
`cgroup *`, …) and arena pointers in `u64` fields because BTF cannot
express a pointer to a per-allocation type. Without intervention the
renderer treats them as integers and the failure dump shows raw
0xffff…ffff values with no further chase.
The cast analyzer (`monitor::cast_analysis::analyze_casts`) closes
that gap. The freeze coordinator runs it once per scheduler load,
before any periodic capture or on-demand snapshot would consume
its output:
1. The host loads the scheduler binary and locates each `.bpf.o`
ELF in the build artifacts.
2. Each program section is decoded through
`cast_analysis::BpfInsn::from_le_bytes` into a flat `&[BpfInsn]`
slab; relocations against `.bss` / `.data` / `.rodata` annotate
the corresponding `BPF_LD_IMM64` PCs with their datasec target.
3. `analyze_casts` walks the slab forward, tracking register and
stack-slot state for each instruction. Two detection paths feed
the output: the arena pointer path (LDX through a previously
loaded `u64` field) and the kernel kptr path (STX of a typed
pointer register into a `u64` field). Function-entry seeding
from `bpf_func_info` reseeds R1..R5 from the BTF FuncProto so
typed parameters propagate correctly across subprogram joins.
4. The result is a `CastMap` (`BTreeMap<(source_struct_btf_id,
field_byte_offset), CastHit>`) cached on the per-VM
`KtstrVm.cast_map` (a `LazyCastMap` that runs the analyzer on
first dump and caches the result process-wide by scheduler
binary content hash). The freeze coordinator threads the cached
`CastMap` through `DumpContext::cast_map` into every per-map
render so the renderer can consult it at every dump site.
5. `render_cast_pointer` in `monitor::btf_render` consumes
`CastHit` via `MemReader::cast_lookup`. When a `u64` field at a
recorded `(struct, offset)` is rendered, the renderer chases
the pointer through the address-space-appropriate reader (arena
vs slab/vmalloc) and tags the result with a `cast_annotation`
of `"cast→arena"` or `"cast→kernel"` (plus a `(sdt_alloc)`
suffix when the bridge described below fired). Failure dumps
show the annotation alongside the resolved struct fields, so
cast-recovered pointers are visually distinct from BTF-typed
ones.
The renderer also consults an `sdt_alloc` bridge whenever a chase
target peels to a `BTF_KIND_FWD` forward declaration (typical for
`struct sdt_data __arena *` fields whose body lives in the
sdt_alloc library's BTF rather than the scheduler's program BTF).
The dump-state pre-pass walks each live `scx_allocator` and
populates a `slot_start → ArenaSlotInfo` index — one entry
per live allocator slot, carrying `elem_size`, `header_size`, and
the resolved payload BTF type id — that
`MemReader::resolve_arena_type` (in
`dump::render_map::AccessorMemReader`) range-looks up during the
chase. The lookup finds the slot whose
`[slot_start, slot_start + elem_size)` range contains the chased
address and routes by `offset_in_slot`: a slot-start chase
(`offset == 0`, e.g. the `data` field of `scx_task_map_val`
storing the raw `sdt_alloc()` return) returns the payload type id
with `header_skip = header_size`; a payload-start chase
(`offset == header_size`, e.g. the return of `scx_task_data(p)`
cached in `cached_taskc_raw`) returns the same payload type id
with `header_skip = 0`. The renderer reads `header_skip + btf_size`
bytes from the chased address, slices off the leading
`header_skip` bytes, and renders the payload struct. The
resulting `Ptr` carries a `sdt_alloc`-flavoured annotation:
`"sdt_alloc"` on the BTF-typed `Type::Ptr` arm, and
`"cast→arena (sdt_alloc)"` / `"cast→kernel (sdt_alloc)"` on the
cast-analyzer-driven path. The `sdt_alloc` bridge fires only when
the BTF-only resolve has already exhausted same-name siblings;
false-positive risk on that arm is bounded by the arena-window
range check (`MemReader::resolve_arena_type` returns `None` for
addresses outside every known allocator slot).
A separate cross-BTF Fwd resolution path covers the case where a
`BTF_KIND_FWD` pointee's body lives in a sibling embedded BPF
object's BTF rather than an `sdt_alloc` slot — the typical
multi-`.bpf.objs` shape where one object declares
`struct cgx_target;` (forward) and a sibling object defines
`struct cgx_target { ... }` (full body). The cast-analysis
pre-pass (`vmm::cast_analysis_load::build_fwd_index`) walks every
parsed embedded program BTF and records a
`name -> (btfs index, type_id)` entry for every complete
(`!is_fwd`) `Type::Struct` / `Type::Union`. First-write-wins on
duplicate names: when the same name appears in multiple BTFs the
index keeps the first-seen entry. Anonymous types and `Typedef`
are not indexed (no name to key on, and typedefs add no body —
the chase peels through them via `peel_modifiers_with_id` before
consulting the index). The index is threaded through
`DumpContext::cross_btf` and exposed to the renderer via
`MemReader::cross_btf_resolve_fwd`. When `chase_arena_pointer` /
`render_cast_pointer` peel a chase target through
`peel_modifiers_resolving_fwd` and the local same-BTF sibling
search came up empty, `try_cross_btf_fwd_resolve` consults the
cross-BTF index by the Fwd's name (and aggregate kind — `struct`
vs `union`); a hit returns a `CrossBtfRef { btf, type_id }` and
the chase recursion switches to the resolved sibling BTF for the
pointee render. Cross-BTF resolution does NOT introduce a new
annotation — the body is recovered transparently and the rendered
subtree carries the cast or BTF-typed annotation it would have
had if the same struct lived in the entry BTF. Unlike the
`sdt_alloc` bridge the cross-BTF index is consulted whenever a
Fwd terminal survives the local resolve — there is no
arena-window gate, since the lookup is purely a name-keyed BTF
table and a name miss simply leaves the chase on its existing
"forward declaration; body not in this BTF" skip path.
The analyzer is deliberately conservative: branch joins reset
register and stack state, conflicts drop the offending entry, and
self-stores are rejected. False negatives fall back to raw `u64`
(the prior behavior); false positives would chase garbage and are
avoided. The analysis is unconditional — no test-author
configuration, no opt-in flag — and the freeze coordinator wires
the resulting `CastMap` through every snapshot, periodic capture,
and failure dump.
## Probe pipeline
The probe pipeline captures function arguments and struct fields during
auto-repro. It operates inside the guest VM (not from the host), using
two BPF skeletons that share maps.
### Architecture
```text
crash stack -> extract functions -> BTF resolve -> load skeletons -> poll
|
kprobe skeleton | fentry/fexit skeleton
(kernel entry) | (BPF entry + kernel exit)
| | |
v v v
func_meta_map <--shared--> probe_data
| (entry + exit fields)
trigger fires (ring buffer)
|
read probe_data entries
|
stitch by tptr
|
format with entry→exit diffs
```
### Kprobe skeleton (`probe.bpf.c`)
Attaches to kernel functions via `attach_kprobe`. The BPF handler:
1. Gets the function IP via `bpf_get_func_ip`
2. Looks up `func_meta` from `func_meta_map` (keyed by IP)
3. Captures 6 raw args from `pt_regs`
4. Dereferences struct fields via BTF-resolved offsets
5. Reads char * string params if configured
6. Stores result in `probe_data` (keyed by `(func_ip, task_ptr)`)
The trigger fires via `tp_btf/sched_ext_exit` (inside
`scx_claim_exit()`) and sends an `EVENT_TRIGGER` via ring buffer
with the current task pointer and kernel stack.
### Fentry/fexit skeleton (`fentry_probe.bpf.c`)
Handles both BPF struct_ops callbacks and kernel function exit
capture. Loaded in batches of 4 fentry + 4 fexit programs per
skeleton instance via `set_attach_target`. Shares `probe_data` and
`func_meta_map` with the kprobe skeleton via `reuse_fd`.
A per-slot `is_kernel` rodata flag controls argument access:
- **BPF callbacks** (`is_kernel=0`): `ctx[0]` is a void pointer to
the real callback arguments. The handler dereferences through it.
Uses sentinel IPs (`func_idx | (1<<63)`) in `func_meta_map`.
- **Kernel functions** (`is_kernel=1`): args are directly in
`ctx[0..5]`. Uses `bpf_get_func_ip(ctx)` for the real IP,
matching the kprobe entry handler's key.
Fexit handlers look up the existing `probe_data` entry (written by
fentry or kprobe at function entry) and re-read struct fields into
`exit_fields`. This captures post-mutation state for paired display.
### BTF resolution
Two BTF sources:
- **vmlinux BTF** (`btf-rs`): resolves kernel struct offsets. Types in
`STRUCT_FIELDS` (task_struct, rq, scx_dispatch_q, etc.) use curated
field lists with chained pointer dereferences (e.g.
`->cpus_ptr->bits[0]`). Other struct pointer params get scalar, enum,
and cpumask pointer fields auto-discovered from vmlinux BTF.
- **Program BTF** (`libbpf-rs`): resolves BPF-local struct offsets for
types not in vmlinux (e.g. scheduler-defined `task_ctx`).
Auto-discovers scalar, enum, and cpumask pointer fields.
Callback signatures are resolved by:
1. `____name` inner function in program BTF (typed params)
2. `sched_ext_ops` member in vmlinux BTF (fallback)
3. Wrapper function (void *ctx, no useful params)
### Field decoding
The output formatter decodes field values based on their key name:
- `dsq_id` -> `SCX_DSQ_INVALID`, `SCX_DSQ_GLOBAL`, `SCX_DSQ_LOCAL`, `SCX_DSQ_BYPASS`, `SCX_DSQ_LOCAL_ON|{cpu}`, `BUILTIN({v})`, `DSQ(0x{hex})`
- `cpumask_0..3` -> coalesced into one `cpus_ptr` field rendered as
`0x{hex}({cpu-list})` — the masked hex of the cpumask words
(low-order word first; multi-word masks join with `_` between
64-bit chunks) followed by the run-length-collapsed CPU range
list (e.g. `0xf(0-3)`, `0x1_00000000000000ff(0-7,64)`)
- `enq_flags` -> `WAKEUP|HEAD|PREEMPT`
- `exit_kind` -> `ERROR`, `ERROR_BPF`, `ERROR_STALL`, etc.
- `scx_flags` -> `QUEUED|ENABLED`
- `sticky_cpu` -> `-1` for 0xffffffff
### Event stitching
After the trigger fires, all `probe_data` entries are read, matched
to functions by IP, then filtered to a single task's scheduling
journey:
1. Read the task_struct pointer from the trigger event's
`bpf_get_current_task()` value (`args[0]`)
2. For functions with a task_struct parameter: keep events where
`args[param_idx] == tptr`
3. For functions without a task_struct parameter: keep events where
`task_ptr == tptr` (matched via `bpf_get_current_task()` at
probe time)
Events are sorted by timestamp for chronological output.