ktstr 0.5.2

Test harness for Linux process schedulers
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
# Zero to ktstr

This tutorial walks through writing a complete `#[ktstr_test]` from
scratch. By the end you'll have a working scheduler test that runs
two cgroups with different lifecycle patterns across a multi-LLC
topology, tunes test duration and the watchdog, and asserts
fairness, throughput parity, and cpuset isolation.

## What you'll build

A test named `mixed_workloads` that:

- Runs **two cgroups** on **separate LLCs**:
  - `background_spinner` -- a persistent CPU-bound load that runs
    for the entire test duration.
  - `phased_worker` -- a worker that loops through explicit
    `Spin -> Yield -> Spin -> Yield ...` phases via
    `WorkType::Sequence`.
- Targets a **2-LLC, 4-core topology** so the scheduler has a real
  cache boundary to respect.
- Sets explicit **test duration** and **scx watchdog timeout** via
  `#[ktstr_test]` attributes.
- Asserts **fairness** (per-cgroup spread), **throughput parity**
  (CV across workers + minimum rate), and **cpuset isolation**
  (workers stay on their assigned CPUs). Scheduling gaps and
  host-side runqueue health are checked automatically.

The complete test is at the [end of this page](#the-complete-test).

## Prerequisites

Set up the host and a kernel before continuing:

- [Getting Started]getting-started.md covers KVM access, the
  toolchain, and the dev-dependency.
- A bootable Linux kernel image is required. Build one with
  `cargo ktstr kernel build` or point at a source tree with
  `cargo ktstr test --kernel ../linux`. See
  [Getting Started: Build a kernel]getting-started.md#build-a-kernel
  for the full kernel-management workflow.

Once the dependency is in place, create a file under your crate's
`tests/` directory (e.g. `tests/mixed_workloads.rs`) and follow along.

## Step 1: The skeleton

Every `#[ktstr_test]` is a Rust function that takes `&Ctx` and returns
`Result<AssertResult>`. Start with an empty body that passes
unconditionally:

```rust,ignore
use ktstr::prelude::*;

#[ktstr_test(llcs = 1, cores = 2, threads = 1)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
    let _ = ctx;
    Ok(AssertResult::pass())
}
```

`use ktstr::prelude::*;` brings in every type the test body needs --
`Ctx`, `AssertResult`, `CgroupDef`, `WorkType`, `CpusetSpec`,
`execute_defs`, and the `Result` alias from `anyhow`. The
`#[ktstr_test]` attribute registers the function so `cargo ktstr test`
discovers it and boots a VM with the requested topology.

A test without a `scheduler = ...` attribute runs under the kernel's
default EEVDF scheduler — useful as a baseline. Step 2 swaps in a
sched_ext scheduler so the rest of the tutorial exercises that
scheduler instead.

For the full attribute reference, see
[The #\[ktstr_test\] Macro](writing-tests/ktstr-test-macro.md).

## Step 2: Define your scheduler

To target a sched_ext scheduler, declare it with
`declare_scheduler!` and reference the generated const from
`#[ktstr_test(scheduler = …)]`. The example uses `scx-ktstr`,
the test-fixture scheduler shipped in this workspace; substitute
your own binary name to target a different scheduler.

```rust,ignore
use ktstr::declare_scheduler;
use ktstr::prelude::*;

declare_scheduler!(KTSTR_SCHED, {
    name = "ktstr_sched",
    binary = "scx-ktstr",
});

#[ktstr_test(scheduler = KTSTR_SCHED, llcs = 1, cores = 2, threads = 1)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
    let _ = ctx;
    Ok(AssertResult::pass())
}
```

`declare_scheduler!` emits a `pub static KTSTR_SCHED: Scheduler`
holding the declared fields and registers a private static in the
`KTSTR_SCHEDULERS` distributed slice via `linkme` so
`cargo ktstr verifier` discovers it automatically. The
`scheduler =` slot on `#[ktstr_test]` expects an `&'static Scheduler`
— pass the bare `KTSTR_SCHED` ident.

The macro fields:

- `name` — scheduler name for display and sidecar keys.
- `binary` — binary name for auto-discovery in
  `target/{debug,release}/`, the directory containing the test
  binary, or a `KTSTR_SCHEDULER` override path. When the scheduler
  is a `[[bin]]` target in the same workspace, `cargo build`
  already places it where discovery looks. The resolved binary is
  packed into the VM's initramfs.
- `topology = (numa, llcs, cores, threads)` — optional default VM
  topology. Tests can override individual dimensions via
  `#[ktstr_test(llcs = ...)]`. Omitted here; the per-test
  attributes in Step 4 set every dimension explicitly.
- `sched_args = ["--flag", "--another"]` — optional CLI args
  prepended to every test that uses this scheduler. Useful when a
  scheduler needs the same `--enable-llc`-style switches in every
  run; for one-off variations, use `#[ktstr_test(extra_sched_args = [...])]`
  on the test instead.
- `kernels = ["6.14", "6.15..=7.0"]` — optional set of kernel
  specs the verifier sweep should exercise this scheduler against.
  See [BPF Verifier]running-tests/verifier.md for the cell
  emission contract.

For the full attribute surface (`sysctls`, `kargs`, `config_file`,
gauntlet constraints, scheduler-level assertion overrides), see
[Scheduler Definitions](writing-tests/scheduler-definitions.md).

When the macro doesn't fit — the most common case being inline
JSON config supplied per-test or programmatic composition — define
the `Scheduler` const through the manual builder instead. Step 12
below walks through that path with `scx_layered`.

## Step 3: Add workloads

A `CgroupDef` declares a cgroup along with the workers that will run
inside it. The builder methods configure worker count, the work each
worker performs, scheduling policy, and cpuset assignment.

Add two cgroups -- both running tight CPU spinners for now. Step 5
will swap one of them for a phased workload:

```rust,ignore
use ktstr::prelude::*;

#[ktstr_test(scheduler = KTSTR_SCHED, llcs = 1, cores = 2, threads = 1)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
    execute_defs(ctx, vec![
        CgroupDef::named("background_spinner")
            .workers(2)
            .work_type(WorkType::SpinWait),
        CgroupDef::named("phased_worker")
            .workers(2)
            .work_type(WorkType::SpinWait),
    ])
}
```

Without `.with_cpuset(...)`, a cgroup's workers run on every CPU
in the test's topology — they share the VM's full CPU set
with all other cgroups. `.with_cpuset(CpusetSpec::Llc(idx))`
(introduced in Step 4) restricts a cgroup to one LLC's CPUs, and
the other `CpusetSpec` variants narrow further.

`WorkType::SpinWait` runs a tight CPU spin loop; it is one of many
work primitives -- see [WorkType](concepts/work-types.md) for the
full enum (`Bursty`, `FutexPingPong`, `CachePressure`,
`IoSyncWrite`, `PageFaultChurn`, `MutexContention`, `Sequence`, etc.)
and the work-type-to-scheduler-behavior mapping table.

`execute_defs` is a convenience wrapper that runs each cgroup
concurrently for the test's full duration. Both cgroups are
**persistent** -- they hold for the entire scenario. Use
`execute_steps` when you need to add cgroups mid-run or swap
cpusets between phases; see [Ops and Steps](concepts/ops.md) for
the multi-step API.

## Step 4: Set topology

The `#[ktstr_test]` attribute carries the VM's CPU topology.
Topology dimensions are big-to-little: `numa_nodes` (default 1),
`llcs` (total across all NUMA nodes), `cores` per LLC, and
`threads` per core. Total CPU count is `llcs * cores * threads`.

LLC count matters because the last-level cache is the primary
scheduling boundary -- tasks sharing an LLC benefit from shared
cache lines, while cross-LLC migration carries a cold-cache penalty.
A scheduler that ignores LLC topology will look fine on `llcs = 1`
and start failing as soon as there is a real cache boundary to
respect.

Bump the topology to two LLCs with two cores each (4 CPUs total) so
each cgroup can own its own LLC:

```rust,ignore
#[ktstr_test(scheduler = KTSTR_SCHED, llcs = 2, cores = 2, threads = 1)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
    execute_defs(ctx, vec![
        CgroupDef::named("background_spinner")
            .workers(2)
            .work_type(WorkType::SpinWait)
            .with_cpuset(CpusetSpec::Llc(0)),
        CgroupDef::named("phased_worker")
            .workers(2)
            .work_type(WorkType::SpinWait)
            .with_cpuset(CpusetSpec::Llc(1)),
    ])
}
```

`CpusetSpec::Llc(idx)` confines a cgroup to the CPUs that belong to
LLC `idx`. Other variants (`Numa`, `Range`, `Disjoint`, `Overlap`,
`Exact`) cover NUMA-node binding, fractional partitioning, and
hand-built CPU sets.

For the full topology surface (NUMA accessors, per-LLC info,
cpuset generation helpers), see [TestTopology](concepts/topology.md).

## Step 5: Compose phased work inside a cgroup

So far both cgroups run identical CPU spinners. The point of this
test is to exercise a scheduler against **different lifecycle
patterns** at once, so swap `phased_worker` for a worker that loops
through explicit phases.

`WorkType::Sequence { first: Phase, rest: Vec<Phase> }` runs each
phase for its specified duration and then advances to the next; when
the last phase ends the loop restarts from `first`. Phases:
`Phase::Spin(Duration)`, `Phase::Sleep(Duration)`,
`Phase::Yield(Duration)`, `Phase::Io(Duration)`. Use the
`WorkType::sequence(first, rest)` constructor.

`Phase`, `WorkType`, and `CpusetSpec` are all in
`ktstr::prelude::*`; only `std::time::Duration` needs an extra
`use` line — added on the first line of the example below:

```rust,ignore
use std::time::Duration;
use ktstr::prelude::*;

#[ktstr_test(scheduler = KTSTR_SCHED, llcs = 2, cores = 2, threads = 1)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
    execute_defs(ctx, vec![
        // Persistent CPU pressure on LLC 0 for the whole run.
        CgroupDef::named("background_spinner")
            .workers(2)
            .work_type(WorkType::SpinWait)
            .with_cpuset(CpusetSpec::Llc(0)),
        // Phased worker on LLC 1: spin 100 ms, yield for 20 ms,
        // then loop. Stresses the scheduler's wake-after-yield
        // placement repeatedly while the LLC-0 spinner keeps
        // host runqueue pressure constant.
        CgroupDef::named("phased_worker")
            .workers(2)
            .work_type(WorkType::sequence(
                Phase::Spin(Duration::from_millis(100)),
                [Phase::Yield(Duration::from_millis(20))],
            ))
            .with_cpuset(CpusetSpec::Llc(1)),
    ])
}
```

The two cgroups now exercise distinct paths concurrently:

- `background_spinner` keeps two CPUs continuously busy on LLC 0.
- `phased_worker` alternates between burning CPU and yielding on
  LLC 1, exercising the scheduler's voluntary-preemption + wakeup
  placement code paths.

Both cgroups still run for the **entire scenario duration**: the
phasing happens *within* each `phased_worker` worker's loop, while
`execute_defs` holds both cgroups across the whole run via
`HoldSpec::FULL`. To express phasing across cgroups (e.g. add
`phased_worker` only for the second half of the run), use
`execute_steps` with multiple `Step` entries -- see
[Ops and Steps](concepts/ops.md). Step 9 below adds an `Op::snapshot`
capture into a step's op list.

## Step 6: Tune execution

Several `#[ktstr_test]` attributes control how the VM runs the
scenario. The defaults are tuned for fast iteration; tune up for
longer / heavier runs:

| Attribute | Default | What it does |
|---|---|---|
| `duration_s` | `12` | Per-scenario wall-clock seconds. The framework keeps both cgroups running for `duration_s` seconds, then signals workers to stop and collects reports. |
| `watchdog_timeout_s` | `5` | sched_ext watchdog fire threshold. Applied via `scx_sched.watchdog_timeout` on 7.1+ kernels and the static `scx_watchdog_timeout` symbol on pre-7.1 kernels. When neither path is available the override silently no-ops. |
| `memory_mb` | `2048` | VM memory in MiB. |

`watchdog_timeout_s` is sched_ext's per-task stall threshold — if
a runnable task is not picked for `watchdog_timeout_s` seconds,
the scheduler exits with `SCX_EXIT_ERROR_STALL`. The scenario
duration and watchdog are independent; a 12 s scenario with a 5 s
watchdog is normal. Tune the watchdog only when the scheduler
under test is expected to legitimately leave a runnable task
parked longer than the default 5 s.

For the run we're building, set the duration to 20 s (so each
phase iteration repeats many times):

```rust,ignore
#[ktstr_test(
    scheduler = KTSTR_SCHED,
    llcs = 2,
    cores = 2,
    threads = 1,
    duration_s = 20,
)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
    // body unchanged from Step 5 -- two cgroups via execute_defs
}
```

For the full attribute reference (auto-repro, performance mode,
topology constraints, etc.), see
[The #\[ktstr_test\] Macro](writing-tests/ktstr-test-macro.md).

## Step 7: Add assertions

Default checks already run with no configuration -- `not_starved` is
`Some(true)` in `Assert::default_checks()`, which enables:

- **Starvation** -- any worker with zero work units fails the test.
- **Fairness spread** -- per-cgroup `max(off-CPU%) - min(off-CPU%)`
  must stay under the spread threshold (release default 15%; debug
  default 35% — debug builds in small VMs show higher spread, so
  the threshold loosens automatically when `cfg!(debug_assertions)`
  is true).
- **Scheduling gaps** -- the longest wall-clock gap observed at
  work-unit checkpoints must stay under the gap threshold
  (release default 2000 ms; debug default 3000 ms — same
  `cfg!(debug_assertions)` gate as spread).

Host-side monitor checks (imbalance ratio, DSQ depth, stall
detection, fallback / keep-last event rates) are also enabled by
default with thresholds from `MonitorThresholds::DEFAULT`.

Cpuset isolation is **opt-in** -- enable it with `isolation = true`.
Override the spread threshold and add throughput-parity gates:

```rust,ignore
use std::time::Duration;
use ktstr::prelude::*;

#[ktstr_test(
    scheduler = KTSTR_SCHED,
    llcs = 2,
    cores = 2,
    threads = 1,
    duration_s = 20,
    isolation = true,
    max_spread_pct = 20.0,
    max_throughput_cv = 0.5,
    min_work_rate = 1.0,
)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
    execute_defs(ctx, vec![
        CgroupDef::named("background_spinner")
            .workers(2)
            .work_type(WorkType::SpinWait)
            .with_cpuset(CpusetSpec::Llc(0)),
        CgroupDef::named("phased_worker")
            .workers(2)
            .work_type(WorkType::sequence(
                Phase::Spin(Duration::from_millis(100)),
                [Phase::Yield(Duration::from_millis(20))],
            ))
            .with_cpuset(CpusetSpec::Llc(1)),
    ])
}
```

What each new attribute gates:

- `isolation = true` -- workers must only run on CPUs in their
  assigned cpuset; any execution on an unexpected CPU fails the test.
- `max_spread_pct = 20.0` -- per-cgroup fairness override (the
  release default is 15.0; this loosens it slightly to absorb noise
  from the phased worker's yield-driven re-placement). Bare
  `max_spread_pct = 15.0` would silently match the default and have
  no observable effect.
- `max_throughput_cv = 0.5` -- coefficient of variation of
  `work_units / cpu_time` across workers. Catches a scheduler that
  gives some workers disproportionately less effective CPU.
- `min_work_rate = 1.0` -- minimum work units per CPU-second per
  worker. Catches the case where every worker is equally slow
  (CV passes but absolute throughput is too low).

`#[ktstr_test]` exposes the full `Assert` surface (scheduling gaps,
monitor thresholds, NUMA locality, wake-latency benchmarks). See
[Checking](concepts/checking.md) for the merge chain
(`default_checks() -> Scheduler.assert -> per-test`) and the
complete threshold list.

## Step 8: Run it

Run the test with `cargo ktstr test`, scoped to this one test name:

```sh
cargo ktstr test --kernel ../linux -- -E 'test(mixed_workloads)'
```

If `cargo ktstr test` reports "kernel not found", the `--kernel` path
either points at a directory without a built vmlinux or at a kernel
the cache cannot locate. Run `cargo ktstr kernel build` to populate
the cache, or pass an explicit path to a built kernel source tree —
see [Getting Started: Build a kernel](getting-started.md#build-a-kernel)
for the resolution order.

If a probe-related error surfaces ("probe skeleton load failed",
"trigger attach failed"), re-run with `RUST_LOG=ktstr=debug` to
see the underlying libbpf reason. Common causes: missing tp_btf
target on older kernels (handled automatically by the two-phase
fallback), `CONFIG_DEBUG_INFO_BTF=n` in the guest kernel (rebuild
with BTF enabled), or a verifier rejection on a non-optional
program (the retry surfaces both the original and retry errors so
the verifier output is preserved).

`cargo ktstr test` resolves the kernel image, boots a VM with the
declared topology, runs the test as the guest's init, and reports
the result. A passing run looks like:

```text
    PASS [  11.34s] my_crate::mixed_workloads ktstr/mixed_workloads
```

A failure prints the violated threshold along with per-cgroup stats:

```text
    FAIL [  12.05s] my_crate::mixed_workloads ktstr/mixed_workloads

--- STDERR ---
ktstr_test 'mixed_workloads' [sched=scx-ktstr] [topo=1n2l2c1t] failed:
  unfair cgroup: spread=22% (10-32%) 2 workers on 2 cpus (threshold 20%)

--- stats ---
4 workers, 4 cpus, 12 migrations, worst_spread=22.4%, worst_gap=180ms
  cg0: workers=2 cpus=2 spread=22.4% gap=180ms migrations=8 iter=15230
  cg1: workers=2 cpus=2 spread=4.1% gap=120ms migrations=4 iter=14870
```

The detail line `unfair cgroup: spread=N% (min-max%) N workers on
N cpus (threshold N%)` is the exact format produced by
`assert::assert_not_starved`. Other detail-line shapes the
same producer emits:

- `tid {N} starved (0 work units)` — when a worker made no
  progress. Example:

  ```text
  ktstr_test 'mixed_workloads' [topo=1n2l2c1t] failed:
    tid 2 starved (0 work units)
  ```

- `tid {N} stuck {N}ms on cpu{N} at +{N}ms (threshold {N}ms)`  when a worker's longest off-CPU gap crossed
  `Assert::max_gap_ms`. Example:

  ```text
  ktstr_test 'mixed_workloads' [topo=1n2l2c1t] failed:
    tid 7 stuck 2500ms on cpu3 at +4200ms (threshold 2000ms)
  ```

- `unfair cgroup: spread={pct}% ({lo}-{hi}%)` — when per-cgroup
  fairness exceeded `max_spread_pct`. Example:

  ```text
  ktstr_test 'mixed_workloads' [topo=1n2l2c1t] failed:
    unfair cgroup: spread=22% (10-32%) 2 workers on 2 cpus (threshold 20%)
  ```

The reporting layer does NOT include the cgroup name — `cg{i}`
is the positional index in the stats roll-up (`cg0`, `cg1`, ...)
matching the `cg{i}: workers=... cpus=... spread=...` per-cgroup
stats line emitted by `test_support::eval::evaluate_vm_result`.

For the full run lifecycle, sidecar layout, and analysis workflow,
see [Running Tests](running-tests.md).

## Step 9: Capture a snapshot

Threshold-based assertions tell you something is off; snapshots tell
you *what* the scheduler's state actually was. `Op::snapshot(name)`
asks the host to freeze every vCPU long enough to read the BPF
(in-kernel program) map state, vCPU registers, and per-CPU counters
into a `FailureDumpReport` keyed by `name`, then resumes the guest.

`execute_defs` (used so far) takes a flat list of cgroups and runs
them concurrently. To inject a snapshot mid-run, switch to
`execute_steps`, which takes a list of `Step`s — each step has
`setup` cgroups, an `ops` list (where `Op::snapshot(...)` lives),
and a `hold` duration:

```rust,ignore
use ktstr::prelude::*;

#[ktstr_test(scheduler = KTSTR_SCHED, llcs = 2, cores = 2, threads = 1, duration_s = 20)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
    execute_steps(ctx, vec![
        Step {
            setup: Setup::Defs(vec![
                CgroupDef::named("background_spinner")
                    .workers(2)
                    .work_type(WorkType::SpinWait)
                    .with_cpuset(CpusetSpec::Llc(0)),
                CgroupDef::named("phased_worker")
                    .workers(2)
                    .work_type(WorkType::SpinWait)
                    .with_cpuset(CpusetSpec::Llc(1)),
            ]),
            ops: vec![Op::snapshot("after_workload")],
            hold: HoldSpec::FULL,
        },
    ])
}
```

After the scenario completes, the captured report is keyed by name
on the active `SnapshotBridge` — the host-side store that owns the
captured `FailureDumpReport` map for the run. Downstream test code
drains it and walks scalar variables with the dotted-path accessor —
e.g. `snap.var("nr_cpus_onln").as_u64()?` reads a scheduler global
(any `.bss`/`.data`/`.rodata` symbol; `Snapshot::var` walks all
three) as a `u64`.

For the bridge wiring, the full traversal API
(`Snapshot::map`, `SnapshotEntry::get`, per-CPU narrowing,
error variants), and the symbol-driven
[`Op::watch_snapshot`](writing-tests/watch-snapshots.md) variant
that fires whenever the guest writes a kernel symbol, see
[Snapshots](writing-tests/snapshots.md).

## Step 10: Gauntlet expansion

The `#[ktstr_test]` macro doesn't just emit a single test -- it
also generates a **gauntlet** of variants that run the same body
across every accepted topology preset (single-LLC, multi-LLC,
multi-NUMA, with/without SMT).

Gauntlet variants are nextest-discovered and run with
`cargo ktstr test -- --run-ignored ignored-only -E 'test(gauntlet/)'`.
Constrain coverage with `min_llcs` / `max_llcs`, `min_cpus` /
`max_cpus`, and `requires_smt` on the attribute. See
[Gauntlet Tests](writing-tests/gauntlet-tests.md) for the full
filtering and worked examples.

## Step 11: Name and prioritize workers

Per-cgroup defaults travel through `CgroupDef`'s builder methods so
schedulers that key on `task->comm` or `task_struct->static_prio`
can be exercised with realistic, distinguishable workers:

```rust,ignore
CgroupDef::named("background_spinner")
    .workers(2)
    .comm("bg_spinner")           // prctl(PR_SET_NAME, "bg_spinner")
    .nice(10)                     // setpriority(PRIO_PROCESS, 0, 10)
    .work_type(WorkType::SpinWait)
```

- **`.comm("name")`** — every worker calls `prctl(PR_SET_NAME, name)`
  at startup. The kernel truncates `task->comm` to 15 bytes inside
  `__set_task_comm`. Distinguishes workers in `top` / `ps` output
  and in scheduler tracepoints.
- **`.nice(n)`** — every worker calls
  `setpriority(PRIO_PROCESS, 0, n)`. Values below the calling
  task's current nice require `CAP_SYS_NICE`; ktstr always runs as
  root in-VM so the full `-20..=19` range is available. Skips the
  syscall entirely when `.nice(...)` is not chained (workers
  inherit the parent's nice).
- **`.pcomm("name")`** — set the *thread-group leader*'s
  `task->comm`. Triggers ktstr's fork-then-thread spawn path:
  workers sharing a `pcomm` value coalesce into ONE forked leader
  process whose `task->group_leader->comm` is `name`, with worker
  threads inside it. Models real applications like `chrome` (pcomm)
  hosting `ThreadPoolForeg` (per-thread comm) and `java` (pcomm)
  hosting `GC Thread` / `C2 CompilerThre`.

`pcomm` is a `WorkSpec` field, NOT a `CloneMode` variant. The two
real `CloneMode` variants are `Fork` (default; each worker is its
own thread group) and `Thread` (workers share the harness's tgid as
`std::thread::spawn` threads). `pcomm` triggers an in-process
fork-then-thread shape that combines per-process leader visibility
schedulers expect with the in-process thread-spawn dispatch the
worker bodies use. `PipeIo` and `CachePipe` workers placed in a
`.pcomm(...)` cgroup run as threads inside the pcomm container;
their pipe-pair partner indices are computed within the
container's thread group, not across forked siblings.
`SignalStorm` uses `tkill` (per-task signal delivery,
`PIDTYPE_PID`) rather than `kill` (`PIDTYPE_TGID`), so the
partner-vs-self addressing is correct uniformly across `Fork` and
`Thread` modes — including inside pcomm-coalesced thread groups.

Per-`WorkSpec` overrides win over cgroup-level defaults — write
`.work(WorkSpec::default().nice(0).comm("hot_spinner"))` to opt a
specific worker out of the cgroup-level defaults.

## Step 12: Inline scheduler config

Schedulers like `scx_layered` and `scx_lavd` accept a JSON config via
`--config /path/to/file.json`. Declare the arg template + guest path
on a `Scheduler` const built via the manual builder, then supply the
inline content from the test attribute:

```rust,ignore
const LAYERED_SCHED: Scheduler = Scheduler::new("layered")
    .binary(SchedulerSpec::Discover("scx_layered"))
    .topology(1, 2, 4, 1)
    .config_file_def("--config {file}", "/include-files/layered.json");

const LAYERED_CONFIG: &str = r#"{ "layers": [{ "name": "default" }] }"#;

#[ktstr_test(scheduler = LAYERED_SCHED, config = LAYERED_CONFIG)]
fn layered_default(ctx: &Ctx) -> Result<AssertResult> {
    Ok(AssertResult::pass())
}
```

The framework writes `LAYERED_CONFIG` to the guest at the path
declared on the scheduler (`/include-files/layered.json`) and
substitutes `{file}` in the arg template with that path before
launching the scheduler binary. A scheduler that declares
`config_file_def` REQUIRES every test to supply `config = …`
(compile-time + runtime gate); a scheduler that doesn't declare it
REJECTS `config = …` (the content would be silently dropped). See
[The #\[ktstr_test\] Macro](writing-tests/ktstr-test-macro.md#inline-scheduler-config)
for the full pairing rules.

For schedulers whose config lives on disk on the host (no inline
content), use `Scheduler::config_file(host_path)` instead — the
host file is packed into the initramfs and `--config` is injected
into scheduler args automatically; no `config = …` on the test is
needed in that flavor.

## Step 13: Decouple virtual topology from host hardware

By default, ktstr pins vCPUs to host cores in a layout that mirrors
the declared virtual topology. A test declaring `numa_nodes = 2,
llcs = 8` cannot run on a 1-NUMA-node host — the gauntlet preset
filter rejects it. Set `no_perf_mode = true` to drop the host
mirroring and run the declared virtual topology unchanged:

```rust,ignore
#[ktstr_test(
    numa_nodes = 2,
    llcs = 8,             // 8 % 2 == 0; the macro requires divisibility
    cores = 4,
    no_perf_mode = true,  // VM built as declared, even on 1-NUMA hosts
)]
fn two_node_test(ctx: &Ctx) -> Result<AssertResult> { /* ... */ }
```

In `no_perf_mode`:
- The VM's virtual topology is built as declared via KVM vCPU
  layout, ACPI SRAT/SLIT (x86_64), or FDT cpu nodes (aarch64) —
  the guest sees the full requested NUMA / LLC structure.
- vCPU-to-host-core pinning, 2 MB hugepages, NUMA mbind, RT
  scheduling, and KVM exit suppression are skipped.
- Host topology constraints (`min_numa_nodes`, `min_llcs`,
  `requires_smt`, per-LLC CPU widths) are NOT compared against
  host hardware. The only host check that survives is "total host
  CPUs >= total vCPUs".

`no_perf_mode = true` is mutually exclusive with `performance_mode
= true` (`KtstrTestEntry::validate` rejects the combination at
runtime). Equivalent to setting `KTSTR_NO_PERF_MODE=1` per-test —
either source forces the no-perf path. See
[Performance Mode](concepts/performance-mode.md#tier-2-no-perf-mode-with-cpu-cap-reservation)
for the full lifecycle.

## Step 14: Periodic capture and temporal assertions

On-demand `Op::snapshot` (Step 9) captures the scheduler's BPF state
at a point you choose. **Periodic capture** fires automatically at
evenly-spaced points across the workload window, producing a
time-ordered `SampleSeries` (the host-side container of drained
snapshots, in capture order; `.periodic_only()` filters to
periodic-tagged samples) for temporal assertions. Periodic capture
is only useful when paired with a `post_vm` callback that drains
the bridge and asserts something about the sequence — the two
attributes belong together.

Enable periodic capture with `num_snapshots = N` and register the
host-side callback with `post_vm = function_name`. The callback
drains the bridge and runs assertions over the time-ordered series:

```rust,ignore
use ktstr::prelude::*;

fn check_dispatch_advances(result: &VmResult) -> Result<()> {
    let series = SampleSeries::from_drained(
        result.snapshot_bridge.drain_ordered_with_stats(),
    )
    .periodic_only();

    let mut v = Verdict::new();

    let nr_dispatched: SeriesField<u64> = series.bpf(
        "nr_dispatched",
        |snap| snap.var("nr_dispatched").as_u64(),
    );
    nr_dispatched.nondecreasing(&mut v);

    let r = v.into_result();
    anyhow::ensure!(r.passed, "temporal assertions failed: {:?}", r.details);
    Ok(())
}

#[ktstr_test(
    scheduler = KTSTR_SCHED,
    llcs = 2,
    cores = 2,
    threads = 1,
    duration_s = 20,
    num_snapshots = 5,
    post_vm = check_dispatch_advances,
)]
fn dispatch_advances(ctx: &Ctx) -> Result<AssertResult> {
    execute_defs(ctx, vec![
        CgroupDef::named("workers").workers(2).work_type(WorkType::SpinWait),
    ])
}
```

`num_snapshots = 5` fires 5 freeze-and-capture boundaries inside the
10%-90% window of the 20 s workload — at roughly +5 s, +7 s, +10 s,
+13 s, +15 s. Each capture freezes every vCPU, reads BPF map state,
and resumes. The host watchdog deadline is extended by each freeze
duration so captures do not eat into the workload budget. The
captures are stored under `periodic_000`…`periodic_004` on the
`SnapshotBridge`.

`Verdict` is the assertion accumulator: every pattern call records
its outcome on the same `Verdict`, and `v.into_result()` consumes it
into a pass/fail `AssertResult`.

The seven temporal patterns on `SeriesField`:

| Pattern | Type | What it checks |
|---|---|---|
| `nondecreasing` | u64/f64 | Every consecutive pair: `v[i] <= v[i+1]` |
| `strictly_increasing` | u64/f64 | Every consecutive pair: `v[i] < v[i+1]` |
| `rate_within(lo, hi)` | f64 | Per-pair `delta_value / delta_ms` in `[lo, hi]` |
| `steady_within(warmup_ms, tol)` | f64 | Post-warmup values within `mean ± tol%` |
| `converges_to(target, tol, deadline_ms)` | f64 | 3 consecutive samples in `[target ± tol]` before deadline |
| `ratio_within(other, lo, hi)` | f64 | Per-sample `self / other` in `[lo, hi]` (cross-field) |
| `always_true` | bool | Every sample is `true` |

Every pattern method takes `&mut Verdict` as its first argument and
returns it, so calls chain into the same accumulator.

`SeriesField::each` provides per-sample scalar bounds:
`field.each(&mut v).at_least(1u64)`,
`field.each(&mut v).between(0.0, 100.0)`.

When a temporal pattern fails, the `AssertDetail` entries
identify the offending sample by tag and elapsed-ms timestamp.
Example for `nondecreasing` flagging a regression on
`nr_dispatched`:

```text
nr_dispatched (nondecreasing): regression at sample periodic_002 (+10000ms): \
value 41 after prior value 42 at sample periodic_001 (+7000ms)
```

The rate, steady, converges, ratio, and always-true variants emit
parallel shapes — every detail names the pattern, the specific
sample(s) involved, and the violating value, so a failing test
points at the data without re-running.

For boundary timing, spacing rules, and the bridge cap, see
[Periodic Capture](writing-tests/periodic-capture.md). For the full
projection API (`bpf`, `stats`, auto-projectors) and failure
rendering, see
[Temporal Assertions](writing-tests/temporal-assertions.md).

## Step 15: After the run — test statistics

`cargo ktstr stats` aggregates the sidecar JSON files that each test
variant writes — useful for tracking gauntlet coverage, BPF verifier
complexity, and scheduling behavior across commits. This is a
post-run CLI workflow, not part of the test definition:

```sh
cargo ktstr stats                                 # summary: gauntlet coverage, verifier, KVM stats
cargo ktstr stats list                            # list runs with date, test count, arch
cargo ktstr stats compare --a-kernel 6.14 \       # diff sidecar partitions defined by
    --b-kernel 6.15                               #   per-side --a-X / --b-X filter flags
```

Statistics are collected even on test failure (`if: !cancelled()` in
CI). For the full subcommand surface, see
[cargo-ktstr stats](running-tests/cargo-ktstr.md#stats).

## The complete test

The shape exercised by every step above, in one file.
`sched_args = ["--slow"]` always-applies scx-ktstr's `--slow` mode
(Step 2); `watchdog_timeout_s = 10` overrides the sched_ext stall
threshold (Step 6); `num_snapshots` + `post_vm` enable periodic
capture and a temporal assertion (Step 14):

```rust,ignore
use std::time::Duration;
use ktstr::declare_scheduler;
use ktstr::prelude::*;

declare_scheduler!(KTSTR_SCHED, {
    name = "ktstr_sched",
    binary = "scx-ktstr",
    sched_args = ["--slow"],
});

fn check_dispatch_advances(result: &VmResult) -> Result<()> {
    let series = SampleSeries::from_drained(
        result.snapshot_bridge.drain_ordered_with_stats(),
    )
    .periodic_only();

    let mut v = Verdict::new();

    let nr_dispatched: SeriesField<u64> = series.bpf(
        "nr_dispatched",
        |snap| snap.var("nr_dispatched").as_u64(),
    );
    nr_dispatched.nondecreasing(&mut v);

    let r = v.into_result();
    anyhow::ensure!(r.passed, "temporal assertions failed: {:?}", r.details);
    Ok(())
}

#[ktstr_test(
    scheduler = KTSTR_SCHED,
    llcs = 2,
    cores = 2,
    threads = 1,
    duration_s = 20,
    watchdog_timeout_s = 10,
    isolation = true,
    max_spread_pct = 20.0,
    max_throughput_cv = 0.5,
    min_work_rate = 1.0,
    num_snapshots = 5,
    post_vm = check_dispatch_advances,
)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
    execute_defs(ctx, vec![
        CgroupDef::named("background_spinner")
            .workers(2)
            .work_type(WorkType::SpinWait)
            .with_cpuset(CpusetSpec::Llc(0)),
        CgroupDef::named("phased_worker")
            .workers(2)
            .work_type(WorkType::sequence(
                Phase::Spin(Duration::from_millis(100)),
                [Phase::Yield(Duration::from_millis(20))],
            ))
            .with_cpuset(CpusetSpec::Llc(1)),
    ])
}
```

Run it:

```sh
cargo ktstr test --kernel ../linux -- -E 'test(mixed_workloads)'
```

## What you'll see when things break

The output examples below are the shapes ktstr emits in real
runs. They're worth skimming before you ship a test so a future
failure is recognisable.

### Auto-repro probe chain

When the scheduler crashes, ktstr re-runs the scenario with BPF
probes attached and dumps the path leading to the exit. Decoded
struct fields appear inline, with `→` between fentry-captured
entry values and fexit-captured exit values:

```text
ktstr_test 'demo_host_crash_auto_repro' [sched=scx-ktstr] [topo=1n1l2c1t] failed:
  scheduler died

--- auto-repro ---
=== AUTO-PROBE: scx_exit fired ===

  ktstr_enqueue                                                   main.bpf.c:21
    task_struct *p
      pid         97
      cpus_ptr    0xf(0-3)
      dsq_id      SCX_DSQ_INVALID
      enq_flags   NONE
      slice       0
      vtime       0
      weight      100
      sticky_cpu  -1
      scx_flags   QUEUED|ENABLED
  do_enqueue_task                                               kernel/sched/ext.c:1344
    rq *rq
      cpu         1
    task_struct *p
      pid         97
      cpus_ptr    0xf(0-3)
      dsq_id      SCX_DSQ_INVALID          →  SCX_DSQ_LOCAL
      enq_flags   NONE
      slice       20000000
      vtime       0
      weight      100
      sticky_cpu  -1
      scx_flags   QUEUED|DEQD_FOR_SLEEP    →  QUEUED
```

For the probe pipeline architecture, the BTF resolution path,
event-stitching rules, and the `demo_host_crash_auto_repro`
fixture, see [Auto-Repro](running-tests/auto-repro.md).

### Failure dumps with cast-recovered pointers

The freeze coordinator builds a
[`FailureDumpReport`](architecture/monitor.md) on every snapshot,
periodic capture, and post-failure dump. Each captured map prints
as a `map <name> (type=..., value_size=..., max_entries=...)`
header followed by the rendered value (single-entry global
sections like `.bss`/`.data`) or `entry: key=...` blocks
(multi-entry maps). `u64` fields the
[cast analyzer](architecture/monitor.md#cast-analysis) flagged as
typed pointers chase to the recovered struct and print with a
`(cast→arena)` or `(cast→kernel)` annotation distinguishing them
from BTF-typed pointers; an `(sdt_alloc)` suffix is added when the
sdt_alloc bridge recovered the real payload struct from a
forward-declared pointee. A separate cross-BTF Fwd resolution
path also recovers a forward-declared pointee whose body lives
in a sibling embedded BPF object's BTF — that path adds no
annotation, the body is rendered transparently:

```text
map scx_lavd.bss (type=array, value_size=4096, max_entries=1)
.bss:
  nr_cpus_onln=4
  task_ctx_root 0xffff888103a01000 (cast→arena) → task_ctx{cpu_id=2, last_runtime_ns=12345678, nice=0}
  current_task 0xffff90124f80c000 (cast→kernel) → task_struct:
    pid=4321   weight=100
    cpus_ptr 0xffff888103b40000 → cpus={0-3}
  taskc_data 0x7f0000080000 (cast→arena (sdt_alloc)) → task_data{slice_ns=20000000, vtime=12345678}
```

A field that the analyzer cannot prove is a pointer falls back to
its raw `u64` shape, which is the prior behavior — no
test-author configuration is required either way.

### Verifier output

`cargo ktstr verifier` runs the BPF verifier against every
`declare_scheduler!`-registered scheduler's struct_ops programs
inside a real kernel and prints per-program verified-instruction
counts. The dispatcher hands off to
`cargo nextest run -E 'test(/^verifier/)'`; nextest fans out
across (scheduler × declared kernel × accepted topology preset)
cells, each cell booting its own VM. Per-cell output starts with
a banner identifying the axis values:

```text
=== ktstr_sched | kernel kernel_6_14_2 | topology tiny-1llc ===

verifier
  enqueue                                  verified_insns=42

verifier --- verifier stats ---
  processed=42  states=8/10

verifier --- scheduler log ---
func#0 @0
0: R1=ctx() R10=fp0
processed 42 insns (limit 1000000) max_states_per_insn 1 total_states 10 peak_states 8 mark_read 5
```

When the scheduler did not capture a log, the output is just the
per-program table:

```text
=== ktstr_sched | kernel kernel_6_14_2 | topology tiny-1llc ===

verifier
  enqueue                                  verified_insns=500
  dispatch                                 verified_insns=1200
  init                                     verified_insns=300
```

`--raw` disables cycle collapsing in the scheduler-log section.
`--kernel A --kernel B` runs the sweep against multiple kernels;
the cell handler walks `KTSTR_KERNEL_LIST` to match each cell's
sanitized kernel label against the resolved set. For the full
verifier-sweep model, cycle-collapse rules, and the
cell-name → kernel matching contract, see
[Verifier](running-tests/verifier.md).

## What's next

- [Custom Scenarios]writing-tests/custom-scenarios.md -- when the
  declarative ops API is not enough and the scenario needs arbitrary
  Rust logic between phases.
- [Ops and Steps]concepts/ops.md -- multi-phase scenarios:
  add/remove cgroups, swap cpusets, freeze, resume.
- [Watch Snapshots]writing-tests/watch-snapshots.md --
  `Op::watch_snapshot("symbol")` registers a hardware data-write
  watchpoint (up to 3 per scenario; slot 0 is reserved for the
  error-class exit_kind trigger).
- [MemPolicy]concepts/mem-policy.md -- NUMA-aware tests that bind
  memory allocations to specific nodes and check page locality.
- [Performance Mode]concepts/performance-mode.md -- pinned vCPUs,
  hugepages, and LLC-exclusivity validation for benchmark-grade runs.
- [Auto-Repro]running-tests/auto-repro.md -- on a scheduler crash,
  ktstr can boot a second VM with probes attached and dump the
  failing state automatically.
- [Recipes]recipes.md -- task-specific guides
  (test a new scheduler, A/B compare branches, customize checking,
  benchmarking, host-state diff, ctprof).