# Checking
ktstr checks scheduler behavior through two channels: worker-side
telemetry and host-side monitoring.
## Worker checks
After each scenario, ktstr collects
[`WorkerReport`](../architecture/workers.md#telemetry) from every worker
process. Several checks run against these reports:
**Starvation** -- any worker with `work_units == 0` fails the test.
**Fairness** -- workers in the same cgroup should get similar CPU time.
The "spread" (max off-CPU% - min off-CPU%) must be below a threshold
(15% in release builds, 35% in debug). Violations report the spread
and per-cgroup statistics.
**Scheduling gaps** -- the longest wall-clock gap observed at
work-unit checkpoints. Gaps above a threshold (2000ms release, 3000ms
debug) indicate the scheduler dropped a task. Reports include the gap
duration, CPU, and timing.
**Cpuset isolation** -- workers must only run on CPUs in their assigned
cpuset. Any execution on an unexpected CPU fails the test. Opt-in via
`isolation = true` on the `#[ktstr_test]` attribute or via
`Assert::check_isolation()`; `Assert::default_checks()` leaves this
`None`, so the runtime merge resolves to `false` and the check is
skipped unless explicitly enabled.
**Throughput parity** -- `assert_throughput_parity()` checks that
workers produce similar throughput (work_units per CPU-second). Two
thresholds:
- `max_throughput_cv`: coefficient of variation across workers. High
CV means the scheduler gives some workers disproportionately less
effective CPU. Requires at least 2 workers with nonzero CPU time.
- `min_work_rate`: minimum work_units per CPU-second per worker.
Catches cases where all workers are equally slow (CV passes but
absolute throughput is too low).
Neither threshold is set by default; enable via `Assert` setters or
`#[ktstr_test]` attributes.
**Benchmarking** -- `assert_benchmarks()` checks per-wakeup latency
and iteration throughput. Three thresholds:
- `max_p99_wake_latency_ns`: p99 of all `resume_latencies_ns` samples
across workers in a cgroup. Populated only for work types that
record wake-to-run latency: `IoSyncWrite`, `IoRandRead`, `IoConvoy`,
`Bursty`, `PipeIo`,
`FutexPingPong`, `CacheYield`, `CachePipe`, `FutexFanOut`
(receivers), `Sequence` (Sleep / Yield / Io phases),
`ForkExit`, `NiceSweep`, `AffinityChurn`, `PolicyChurn`,
`FanOutCompute`, `MutexContention`. Pure-CPU work types
(`SpinWait`, `Mixed`, `CachePressure`, `PageFaultChurn`) do not
record samples.
- `max_wake_latency_cv`: coefficient of variation of wake latency
samples. High CV means inconsistent scheduling latency.
- `min_iteration_rate`: minimum outer-loop iterations per wall-clock
second per worker.
None are set by default. Set via `Assert` setters or `#[ktstr_test]`
attributes.
## Monitor checks
The [host-side monitor](../architecture/monitor.md) reads guest VM
memory (per-CPU runqueue structs via BTF offsets) and evaluates:
- **Imbalance ratio**: `max(nr_running) / max(1, min(nr_running))`
across CPUs. The denominator is clamped to 1 so an all-idle sample
does not divide by zero.
- **Local DSQ depth**: per-CPU dispatch queue depth.
- **Stall detection**: `rq_clock` not advancing on a CPU with
runnable tasks. Idle CPUs and preempted vCPUs are exempt. See
[Monitor: Stall detection](../architecture/monitor.md#stall-detection)
for exemption details.
- **Event rates**: scx fallback and keep-last event counters.
Monitor thresholds use a sustained sample window (default: 5 samples).
A violation must persist for N consecutive samples before failing.
## NUMA checks
When workers use a [`MemPolicy`](mem-policy.md), ktstr collects NUMA
page placement data and checks it against thresholds:
**Page locality** -- `assert_page_locality()` checks the fraction of
pages residing on the expected NUMA node(s). Expected nodes are derived
from the worker's `MemPolicy::node_set()` at evaluation time. Page
counts come from `WorkerReport::numa_pages` (parsed from
`/proc/self/numa_maps`). Returns 1.0 (vacuously local) when no pages
are observed. Fails if the observed fraction falls below
`min_page_locality`.
**Cross-node migration** -- `assert_cross_node_migration()` checks
the ratio of migrated pages to total allocated pages.
`WorkerReport::vmstat_numa_pages_migrated` provides the delta of the
`numa_pages_migrated` counter from `/proc/vmstat` over the work loop.
Fails if the ratio exceeds `max_cross_node_migration_ratio`.
**Slow-tier ratio** -- `max_slow_tier_ratio` checks the fraction of
pages on memory-only NUMA nodes (CXL tiers). Fails if more than the
specified fraction of pages land on memory-only nodes.
None of these thresholds are set by default. Set via `Assert` setters
or `#[ktstr_test]` attributes.
## Assert struct
`Assert` is a composable configuration that carries both worker checks
and monitor thresholds:
```rust,ignore
pub struct Assert {
// Worker checks
pub not_starved: Option<bool>,
pub isolation: Option<bool>,
pub max_gap_ms: Option<u64>,
pub max_spread_pct: Option<f64>,
// Throughput checks
pub max_throughput_cv: Option<f64>,
pub min_work_rate: Option<f64>,
// Benchmarking checks
pub max_p99_wake_latency_ns: Option<u64>,
pub max_wake_latency_cv: Option<f64>,
pub min_iteration_rate: Option<f64>,
pub max_migration_ratio: Option<f64>,
// Monitor checks
pub max_imbalance_ratio: Option<f64>,
pub max_local_dsq_depth: Option<u32>,
pub fail_on_stall: Option<bool>,
pub sustained_samples: Option<usize>,
pub max_fallback_rate: Option<f64>,
pub max_keep_last_rate: Option<f64>,
// NUMA checks
pub min_page_locality: Option<f64>,
pub max_cross_node_migration_ratio: Option<f64>,
pub max_slow_tier_ratio: Option<f64>,
}
```
Every field is `Option`. `None` means "inherit from parent layer."
## Merge layers
Checking uses a three-layer merge:
1. `Assert::default_checks()` -- baseline: `not_starved` enabled,
monitor thresholds from `MonitorThresholds::DEFAULT`.
2. `Scheduler.assert` -- scheduler-level overrides.
3. Per-test `assert` -- test-specific overrides via `#[ktstr_test]`
attributes.
All fields use last-`Some`-wins semantics. A `Some(false)` in a
higher layer can disable a check that a lower layer enabled.
```rust,ignore
let final_assert = Assert::default_checks()
.merge(&scheduler.assert)
.merge(&test_assert);
```
## Default thresholds
### Worker checks
| Scheduling gap | 2000 ms | 3000 ms |
| Fairness spread | 15% | 35% |
Debug builds run in small VMs with higher scheduling overhead, so
thresholds are relaxed. Coverage-instrumented builds collect profraw
data for code coverage analysis; all assertion and monitor threshold
checks run normally.
### Monitor checks
| `max_imbalance_ratio` | 4.0 | `max(nr_running) / max(1, min(nr_running))` across CPUs (denominator clamped to 1 so an all-idle sample does not divide by zero). Lower values (2-3) false-positive during cpuset transitions. |
| `max_local_dsq_depth` | 50 | Per-CPU dispatch queue overflow. Sustained depth above this means the scheduler is not consuming dispatched tasks. |
| `fail_on_stall` | true | Fail when `rq_clock` does not advance on a CPU with runnable tasks. Idle CPUs (NOHZ) and preempted vCPUs are exempt. |
| `sustained_samples` | 5 | At ~100ms sample interval, requires ~500ms of sustained violation. Filters transient spikes from cpuset reconfiguration. |
| `max_fallback_rate` | 200.0/s | `select_cpu_fallback` events per second across all CPUs. Sustained rate indicates systematic `select_cpu` failure. |
| `max_keep_last_rate` | 100.0/s | `dispatch_keep_last` events per second across all CPUs. Sustained rate indicates dispatch starvation. |
All monitor thresholds use the `sustained_samples` window -- a
violation must persist for N consecutive samples before failing.
## Worker checks via Assert
`Assert` provides `assert_cgroup()` for running worker-side checks
directly against collected reports:
```rust,ignore
let a = Assert::default_checks().max_gap_ms(5000);
let result = a.assert_cgroup(&reports, Some(&cpuset));
```
Use `Assert` for both the merge chain (`#[ktstr_test]` attributes,
`Scheduler.assert`, `execute_steps_with`) and direct report checking.
## Constants
- `Assert::NO_OVERRIDES` -- identity for `merge`; every field is `None`,
so it overrides nothing. This is not "no checks" -- when used as a
per-test or per-scheduler `assert`, the runtime chain still applies
defaults because it merges `default_checks() -> scheduler -> test`.
- `Assert::default_checks()` -- `not_starved` enabled, monitor
thresholds populated from `MonitorThresholds::DEFAULT`.
## AssertResult
`AssertResult` carries pass/fail status, diagnostic messages, and
aggregated statistics from a scenario run.
### Construction
- `AssertResult::pass()` -- creates a passing result with empty
details and default stats.
- `AssertResult::skip(reason)` -- creates a passing result with a
skip reason in `details` and `skipped = true`. Used when a
scenario cannot run under the current topology or flag
combination but is not a failure.
- `AssertResult::fail(detail)` -- failing result carrying a single
`AssertDetail`. Mirrors `pass` / `skip` for the failure axis.
- `AssertResult::fail_msg(msg)` -- shortcut for the common case
where the failure is a plain diagnostic message tagged
`DetailKind::Other`.
### Mutation and inspection
- `result.note(msg)` -- append an informational annotation tagged
`DetailKind::Note`. Does NOT flip `passed` or `skipped` — a
note is context, not a verdict. Returns `&mut Self` so calls
chain.
- `result.with_note(msg)` -- builder-style sibling of `note` that
consumes and returns `self`. Use at the return site to chain a
context annotation onto a fresh result without an intermediate
`let mut`.
- `result.is_skipped()` -- convenience accessor returning
`skipped`. Stats tooling uses this to subtract non-executions
from pass counts.
- `result.is_failed()` -- convenience accessor returning
`!passed`. Mirrors `is_skipped` so branches reading "did this
claim fail?" don't negate `.passed` inline.
### Fields
- `passed: bool` -- whether all checks passed.
- `skipped: bool` -- distinguishes a passing result that ran every
check from one that skipped execution (topology / flag mismatch,
prerequisite absent). `AssertResult::skip` sets this; `pass` /
`fail` / `fail_msg` leave it `false`.
- `details: Vec<AssertDetail>` -- structured diagnostic entries; each
carries a `kind: DetailKind` (`Other`, `Note`, `Skip`, `Temporal`,
…) plus a human-readable `message: String`. Consumers filter by
`kind` for routing (failure vs informational note) and read
`message` for display.
- `stats: ScenarioStats` -- aggregated worker telemetry across all
cgroups (spread, gaps, migrations, wake latency, iterations).
- `measurements: BTreeMap<String, NoteValue>` -- structured
per-test measurements keyed by name. Sidecar consumers and
comparison tooling read this map directly without parsing
`details` strings, so populate it (via `Verdict::note_value`
during claim evaluation) for any value a downstream comparison
needs to lift programmatically.
### Merging
`result.merge(other)` combines two results. If `other.passed` is
false, the merged result is also false. Details and stats are
accumulated:
```rust,ignore
let mut combined = AssertResult::pass();
combined.merge(cgroup_0_result);
combined.merge(cgroup_1_result);
// combined.passed is false if either cgroup failed
// combined.details contains messages from both
```
Stats merging takes worst values across cgroups for spread, gap, wake
latency, and migration ratio. Counters (`total_workers`, `total_cpus`,
`total_migrations`, `total_iterations`) are summed.
For examples of overriding thresholds at the scheduler and per-test
level, see [Customize Checking](../recipes/custom-checking.md).