# Monitor
The monitor observes scheduler state from the host side by reading
guest VM memory directly. It does not instrument the guest kernel or
the scheduler under test.
## What it reads
The monitor resolves kernel structure offsets via BTF (BPF Type Format)
from the guest kernel. It reads per-CPU runqueue structures to extract:
- `nr_running` -- number of runnable tasks on each CPU
- `scx_nr_running` -- tasks managed by the sched_ext scheduler
- `rq_clock` -- runqueue clock value
- `local_dsq_depth` -- scx local dispatch queue depth
- `scx_flags` -- sched_ext flags for each CPU
- scx event counters (fallback, keep-last, offline dispatch,
skip-exiting, skip-migration-disabled)
When `CONFIG_SCHEDSTATS` is enabled, the monitor also reads per-CPU
`struct rq` schedstat fields (run_delay, pcount, sched_count,
ttwu_count, etc.).
The monitor walks the `struct sched_domain` tree whenever BTF
contains `rq->sd` and `struct sched_domain` — no `CONFIG_SCHEDSTATS`
required. Domain tree walking starts at `rq->sd` (lowest level) and
follows `sd->parent` pointers up to the root. Each domain level
provides topology metadata (level, name, flags, span_weight) and
runtime fields (balance_interval, nr_balance_failed, newidle_call,
newidle_success, newidle_ratio, max_newidle_lb_cost). When
`CONFIG_SCHEDSTATS` is also enabled, each
domain additionally provides load balancing stats: `lb_count`,
`lb_failed`, `lb_balanced`, `alb_pushed`, `ttwu_wake_remote`, and
other counters indexed by idle type (`CPU_NOT_IDLE`, `CPU_IDLE`,
`CPU_NEWLY_IDLE`).
## Sampling
The monitor takes periodic snapshots (`MonitorSample`) of all per-CPU
state. Each sample captures a point-in-time view of every CPU.
`MonitorSummary` aggregates samples into peak values (max imbalance
ratio, max DSQ depth, stall detection), per-sample averages
(imbalance ratio, nr_running per CPU, DSQ depth per CPU), and event
counter deltas. Averages are computed over valid samples only
(excluding uninitialized guest memory).
## Threshold evaluation
`MonitorThresholds` defines pass/fail conditions:
```rust,ignore
pub struct MonitorThresholds {
pub max_imbalance_ratio: f64,
pub max_local_dsq_depth: u32,
pub fail_on_stall: bool,
pub sustained_samples: usize,
pub max_fallback_rate: f64,
pub max_keep_last_rate: f64,
}
```
A violation must persist for `sustained_samples` consecutive samples
before triggering a failure. This filters transient spikes from cpuset
transitions and cgroup creation/destruction.
### Stall detection
A stall is detected when a CPU's `rq_clock` does not advance between
consecutive samples. Three exemptions prevent false positives:
- **Idle CPUs**: when `nr_running == 0` in both the current and previous
sample, the CPU has no runnable tasks. The kernel stops the tick
(NOHZ) on idle CPUs, so `rq_clock` legitimately does not advance.
These CPUs are excluded from stall checks.
- **Preempted vCPUs**: when the vCPU thread's CPU time did not advance
past the preemption threshold between samples, the host preempted the
vCPU. These samples are excluded from stall checks.
- **Sustained window**: stall detection uses per-CPU consecutive
counters and the `sustained_samples` threshold, matching how
imbalance and DSQ depth checks work. A single stuck sample does
not trigger failure -- the stall must persist for `sustained_samples`
consecutive samples on the same CPU.
## Uninitialized memory detection
Before the guest kernel initializes per-CPU structures, monitor reads
return uninitialized data. Two layers handle this:
- **Summary computation** (`MonitorSummary::from_samples`): skips
individual samples where any CPU's `local_dsq_depth` exceeds
`DSQ_PLAUSIBILITY_CEILING` (10,000) via `sample_looks_valid()`.
- **Threshold evaluation** (`MonitorThresholds::evaluate`): checks all
samples globally via `data_looks_valid()`. If all `rq_clock` values
are identical across every CPU and sample, or any sample exceeds the
plausibility ceiling, the entire report is passed as "not yet
initialized" — no per-threshold checks run.
## BPF map introspection
The monitor module also provides host-side BPF map discovery and
read/write access via `bpf_map::BpfMapAccessor`. The host reads and
writes guest BPF maps directly through the physical memory mapping
— no guest cooperation or BPF syscalls are needed.
### GuestMem
`GuestMem` wraps a host pointer to the start of guest DRAM and
provides bounds-checked volatile reads and writes for scalar types
(u8/u32/u64). Byte-slice reads (`read_bytes`) use
`copy_nonoverlapping`. It also implements x86-64 page table walks
(`translate_kva`) for both 4-level and 5-level paging, and 3-level
aarch64 walks (64KB granule).
Scalar accesses use volatile semantics because the guest kernel
modifies memory concurrently.
### GuestKernel
`GuestKernel` builds on `GuestMem` by adding kernel symbol
resolution and address translation. It parses the vmlinux ELF
symbol table at construction and resolves paging configuration
(PAGE_OFFSET, CR3, 4-level vs 5-level) from guest memory.
Subsequent reads use cached state.
Three address translation modes are supported:
- **Text/data/bss**: `kva - __START_KERNEL_map`. For statically-linked
kernel variables (`read_symbol_*`, `write_symbol_*`).
- **Direct mapping**: `kva - PAGE_OFFSET`. For SLAB allocations,
per-CPU data, physically contiguous memory (`read_direct_*`).
- **Vmalloc/vmap**: Page table walk via CR3. For BPF maps, vmalloc'd
memory, module text (`read_kva_*`, `write_kva_*`).
### BpfMapAccessor
`BpfMapAccessor` resolves BTF offsets for BPF map kernel structures
(`struct bpf_map`, `struct bpf_array`, `struct xa_node`, `struct idr`)
and provides map discovery and value read/write. It borrows a
`GuestKernel` for address translation.
`BpfMapAccessorOwned` is a convenience wrapper that owns the
`GuestKernel` internally. Use `BpfMapAccessor::from_guest_kernel`
when you already have a `GuestKernel`; use `BpfMapAccessorOwned::new`
when you want a self-contained accessor.
Map discovery walks the kernel's `map_idr` xarray:
1. Read `map_idr` (BSS symbol, text mapping translation)
2. Walk xa_node tree (SLAB-allocated, direct mapping translation)
3. Read `struct bpf_map` fields (vmalloc'd, page table walk)
`find_map` searches by **name suffix** (e.g. `".bss"` matches
`"mitosis.bss"`). Only `BPF_MAP_TYPE_ARRAY` maps are returned.
Use `maps()` to enumerate all map types without filtering.
Value access for `BPF_MAP_TYPE_ARRAY` maps reads/writes the inline
`bpf_array.value` flex array at the BTF-resolved offset. The value
region is vmalloc'd, so each byte access goes through the page table
walker to handle page boundaries.
For `BPF_MAP_TYPE_PERCPU_ARRAY` maps, `bpf_array.pptrs[key]` holds
a `__percpu` pointer (at the same union offset as `value`). Adding
`__per_cpu_offset[cpu]` yields the per-CPU KVA in the direct mapping.
`read_percpu_array` returns one `Option<Vec<u8>>` per CPU: `Some`
when the per-CPU PA falls within guest memory, `None` when it does not.
### Typed field access
When a map has BTF metadata (`btf_kva != 0`), `resolve_value_layout`
reads the guest's `struct btf` and its `data` blob, parses it with
`btf_rs`, and resolves the value struct's fields. This enables
`read_field` / `write_field` with type-checked `BpfValue` variants.
### Usage example
Find a scheduler's `.bss` map and write a crash variable:
```rust,ignore
let accessor = BpfMapAccessor::from_guest_kernel(&kernel, vmlinux)?;
let bss = accessor.find_map(".bss").expect(".bss map not found");
accessor.write_value_u32(&bss, crash_offset, 1);
```
### BpfMapWrite
`BpfMapWrite` specifies a host-side write to a BPF map during VM
execution. The test runner waits for the scheduler to load (map
becomes discoverable), writes the value, then signals the guest via
SHM to start the scenario.
```rust,ignore
pub struct BpfMapWrite {
pub map_name_suffix: &'static str, // e.g. ".bss"
pub offset: usize, // byte offset in the map value
pub value: u32, // value to write
}
```
Use with `#[ktstr_test]` via the `bpf_map_write` attribute:
```rust,ignore
const BPF_CRASH: BpfMapWrite = BpfMapWrite {
map_name_suffix: ".bss",
offset: 42,
value: 1,
};
#[ktstr_test(bpf_map_write = BPF_CRASH, expect_err = true)]
fn crash_test(ctx: &Ctx) -> Result<AssertResult> {
Ok(AssertResult::pass())
}
```
The map is discovered by name suffix via `BpfMapAccessor::find_map`.
Only `BPF_MAP_TYPE_ARRAY` maps are supported. The write targets a
u32 at the specified byte offset within the map's value region.
### Prerequisites
- **vmlinux**: Required for ELF symbols and BTF. Must match the guest
kernel.
- **nokaslr**: Required on the guest command line. Text mapping
translation assumes `phys_base = 0`.
## Probe pipeline
The probe pipeline captures function arguments and struct fields during
auto-repro. It operates inside the guest VM (not from the host), using
two BPF skeletons that share maps.
### Architecture
```text
crash stack -> extract functions -> BTF resolve -> load skeletons -> poll
|
kprobe skeleton | fentry/fexit skeleton
(kernel entry) | (BPF + kernel exit)
| | |
v v v
func_meta_map <--shared--> probe_data
| (entry + exit fields)
trigger fires (ring buffer)
|
read probe_data entries
|
stitch by tptr
|
format with entry→exit diffs
```
### Kprobe skeleton (`probe.bpf.c`)
Attaches to kernel functions via `attach_kprobe`. The BPF handler:
1. Gets the function IP via `bpf_get_func_ip`
2. Looks up `func_meta` from `func_meta_map` (keyed by IP)
3. Captures 6 raw args from `pt_regs`
4. Dereferences struct fields via BTF-resolved offsets
5. Reads char * string params if configured
6. Stores result in `probe_data` (keyed by `(func_ip, task_ptr)`)
The trigger fires via `tp_btf/sched_ext_exit` (inside
`scx_claim_exit()`) and sends an `EVENT_TRIGGER` via ring buffer
with the current task pointer and kernel stack.
### Fentry/fexit skeleton (`fentry_probe.bpf.c`)
Handles both BPF struct_ops callbacks and kernel function exit
capture. Loaded in batches of 4 fentry + 4 fexit programs per
skeleton instance via `set_attach_target`. Shares `probe_data` and
`func_meta_map` with the kprobe skeleton via `reuse_fd`.
A per-slot `is_kernel` rodata flag controls argument access:
- **BPF callbacks** (`is_kernel=0`): `ctx[0]` is a void pointer to
the real callback arguments. The handler dereferences through it.
Uses sentinel IPs (`func_idx | (1<<63)`) in `func_meta_map`.
- **Kernel functions** (`is_kernel=1`): args are directly in
`ctx[0..5]`. Uses `bpf_get_func_ip(ctx)` for the real IP,
matching the kprobe entry handler's key.
Fexit handlers look up the existing `probe_data` entry (written by
fentry or kprobe at function entry) and re-read struct fields into
`exit_fields`. This captures post-mutation state for paired display.
### BTF resolution
Two BTF sources:
- **vmlinux BTF** (`btf-rs`): resolves kernel struct offsets. Types in
`STRUCT_FIELDS` (task_struct, rq, scx_dispatch_q, etc.) use curated
field lists with chained pointer dereferences (e.g.
`->cpus_ptr->bits[0]`). Other struct pointer params get scalar, enum,
and cpumask pointer fields auto-discovered from vmlinux BTF.
- **Program BTF** (`libbpf-rs`): resolves BPF-local struct offsets for
types not in vmlinux (e.g. scheduler-defined `task_ctx`).
Auto-discovers scalar, enum, and cpumask pointer fields.
Callback signatures are resolved by:
1. `____name` inner function in program BTF (typed params)
2. `sched_ext_ops` member in vmlinux BTF (fallback)
3. Wrapper function (void *ctx, no useful params)
### Field decoding
The output formatter decodes field values based on their key name:
- `dsq_id` -> `SCX_DSQ_INVALID`, `SCX_DSQ_GLOBAL`, `SCX_DSQ_LOCAL`, `SCX_DSQ_BYPASS`, `SCX_DSQ_LOCAL_ON|{cpu}`, `BUILTIN({v})`, `DSQ(0x{hex})`
- `cpumask_0..3` -> coalesced `cpus_ptr 0xf(0-3)`
- `enq_flags` -> `WAKEUP|HEAD|PREEMPT`
- `exit_kind` -> `ERROR`, `ERROR_BPF`, `ERROR_STALL`, etc.
- `scx_flags` -> `QUEUED|ENABLED`
- `sticky_cpu` -> `-1` for 0xffffffff
### Event stitching
After the trigger fires, all `probe_data` entries are read, matched
to functions by IP, then filtered to a single task's scheduling
journey:
1. Read the task_struct pointer from the trigger event's
`bpf_get_current_task()` value (`args[0]`)
2. For functions with a task_struct parameter: keep events where
`args[param_idx] == tptr`
3. For functions without a task_struct parameter: keep events where
`task_ptr == tptr` (matched via `bpf_get_current_task()` at
probe time)
Events are sorted by timestamp for chronological output.