1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
//! Scheduler-process lifecycle: reboot, sched-pid/monitor state, SIGCHLD handling, kill-with-grace.
//!
//! Split from rust_init.rs; the shared consts/statics/imports live in the
//! parent module (`super`), reached via the glob below.
use *;
/// Reboot immediately. Used for fatal init errors and normal shutdown.
pub !
/// Live identity of the currently-attached scheduler, parallel to
/// [`SCHED_PID`]'s pid side-channel. `null` means "no scheduler
/// attached" — the initial value at process start and the post-
/// `Op::DetachScheduler` state. Non-null points at a
/// `&'static SchedulerSpec` (the `binary` field of the
/// `&'static Scheduler` the Op carries), so consumers can read
/// `has_bpf_scheduler()` / `has_active_scheduling()` against the
/// LIVE identity rather than the boot-time `entry.scheduler`
/// descriptor that goes stale after `Op::ReplaceScheduler` swaps
/// the attached binary mid-scenario.
///
/// Storage: `AtomicPtr<SchedulerSpec>` because the value is a
/// reference to immutable static data (every `Scheduler` const
/// declared via `declare_scheduler!` lives in `.rodata` for the
/// lifetime of the process); the producer stores the `&'static
/// SchedulerSpec` re-cast to `*mut`, the consumer reads back as
/// `*const` and dereferences under the SAFETY argument that the
/// pointer either originated from a `&'static SchedulerSpec` (so
/// the `'static` lifetime is the entire process) or is `null`
/// (filtered by the wrapper). `*mut` storage is the only Atomic*
/// type the standard library exposes for raw pointer values — the
/// `*mut` vs `*const` is a Rust-level type distinction, not a
/// kernel-level mutability claim; the pointed-to data is never
/// mutated through this pointer.
///
/// `Acquire`/`Release` ordering pairs with [`SCHED_PID`]'s — the
/// two side channels co-publish a single logical scheduler-attach
/// event, and a reader that observes the new pid via
/// [`sched_pid`] also observes the new scheduler identity via
/// [`current_scheduler`].
static CURRENT_SCHEDULER: AtomicPtr =
new;
/// Active [`SchedExitStop`] handle for the currently-running
/// scheduler's exit monitor. The boot path installs the initial
/// handle here via [`install_initial_sched_exit_monitor`]; the
/// scheduler-lifecycle Op dispatcher swaps it out via
/// [`stop_sched_exit_monitor`] + [`restart_sched_exit_monitor_with_log`]
/// so each post-Op scheduler PID gets its own monitor watching it.
///
/// Mutex (not Atomic) because [`SchedExitStop`] is move-only —
/// `stop_and_join` consumes it. `Option` because Op::DetachScheduler
/// leaves no scheduler attached, so the slot is empty between
/// detach and the next attach.
static SCHED_EXIT_MONITOR_SLOT: = new;
/// Boot-captured context that
/// [`restart_sched_exit_monitor_with_log`] needs to re-supply when
/// it spawns a fresh monitor against the post-Op scheduler PID.
/// `suppress_com2` + `probe_output_done` are determined at boot
/// (based on whether the probe stack is active) and don't change
/// across Op dispatches — capturing once at install time keeps
/// the restart helper signature minimal.
static SCHED_EXIT_MONITOR_BOOT_CTX: = new;
/// Install the boot-time scheduler-exit monitor handle and capture
/// the dispatch context [`restart_sched_exit_monitor_with_log`]
/// needs to spawn replacement monitors. Called once at boot
/// after [`start_sched_exit_monitor`] returns.
///
/// `boot_stop` may be `None` when [`start_sched_exit_monitor`]
/// returned None (no scheduler configured at boot); the slot
/// stays empty and the first Op::AttachScheduler dispatch
/// populates it via [`restart_sched_exit_monitor_with_log`].
pub
/// Stop the currently-installed scheduler-exit monitor (if any).
/// The scheduler-lifecycle Op handler calls this BEFORE SIGTERM-ing
/// the scheduler so the monitor thread exits cleanly without
/// sending the `MSG_TYPE_SCHED_EXIT` message that the host's
/// freeze coordinator would otherwise promote into the run-wide
/// kill flag (per `src/vmm/freeze_coord/dispatch.rs` SchedExit
/// arm). Idempotent — a no-op when the slot is already empty.
pub
/// Returns true iff no scheduler-exit monitor is currently installed.
/// Used by the scenario-Op dispatch layer in `kill_current_scheduler`
/// to `debug_assert!` that `stop_sched_exit_monitor` properly cleared
/// the slot before the subsequent spawn restarts the monitor. The
/// `Op::AttachScheduler` path legitimately bypasses the kill helper
/// (no prior scheduler to stop) and the defensive `take()` in
/// [`restart_sched_exit_monitor_with_log`] handles that path's
/// possibly-non-empty entry — so the invariant is "after kill, slot
/// is empty," not "always empty before restart." Briefly locks the
/// slot mutex; release builds where the assertion is a no-op still
/// pay the lock cost, which is negligible vs the surrounding
/// procfs writes + signal delivery + polling the dispatch site is
/// already doing.
pub
/// Spawn a fresh scheduler-exit monitor for the live SCHED_PID
/// and install it into the slot. Op handler calls this AFTER the
/// new scheduler is spawned and SCHED_PID is published, so the
/// monitor watches the post-Op PID. `log_path` is the per-spawn
/// log file path — all three lifecycle Ops (Attach, Replace,
/// Restart) pass the seq-suffixed path from
/// `staged_scheduler_log_path`.
///
/// Uses the boot-captured `suppress_com2` + `probe_output_done`
/// so the new monitor behaves identically to the boot monitor. If
/// the boot ctx was never installed (degenerate test environment
/// where `install_initial_sched_exit_monitor` never ran) the
/// helper is a no-op and the new scheduler stays unmonitored —
/// the boot path is the only legitimate context that installs
/// the ctx.
pub
/// Read the scheduler PID published by [`start_scheduler`]. Returns
/// `None` when the scheduler has not been spawned yet (the atomic
/// reads as `0`, the sentinel for "unset"). `Acquire` synchronises
/// against the producer's `Release` store so any side effects
/// `start_scheduler` performed before the publish are visible to the
/// reader.
pub
/// Publish `pid` to the [`SCHED_PID`] side channel. Used by the
/// scheduler-lifecycle Op dispatch on the guest to swap the live PID
/// across Detach (`pid = 0`) / Attach (`pid = new child`) /
/// Replace (`pid = swap`) transitions. The boot path
/// ([`spawn_scheduler_from_paths`]) calls this directly with the
/// freshly-spawned `child.id()`.
///
/// `Release` ordering pairs with the `Acquire` load in
/// [`sched_pid`]; the writer's side effects (Op log emit, prior
/// kill) are visible to the next reader.
pub
/// Read the live scheduler identity published by the dispatch
/// arms of `Op::AttachScheduler` / `Op::ReplaceScheduler` (the
/// matching `set_current_scheduler` call site lives in
/// `src/scenario/ops/mod.rs`). Returns `None` when no scheduler
/// is currently attached — the pre-attach state at process start
/// and the post-`Op::DetachScheduler` state.
///
/// `Acquire` ordering synchronises against the producer's
/// `Release` store so any side effects the dispatch path
/// performed before the publish are visible to the reader.
///
/// The returned reference inherits the `'static` lifetime of the
/// stored `&'static SchedulerSpec` — every `Scheduler` declared
/// via `declare_scheduler!` lives in `.rodata` for the process
/// lifetime, and the producer always stores a reference into that
/// region.
/// Publish `scheduler` as the currently-attached scheduler, or
/// clear the slot when `None`. Called by the
/// `Op::AttachScheduler` / `Op::ReplaceScheduler` /
/// `Op::DetachScheduler` dispatch arms in
/// `src/scenario/ops/mod.rs` immediately after the corresponding
/// pid change so the two side channels (pid + identity) stay
/// co-published.
///
/// `Release` ordering pairs with the `Acquire` load in
/// [`current_scheduler`].
pub
/// RAII guard that flips SIGCHLD to a target disposition on
/// construction and restores the previous handler on drop. Used by
/// [`with_sigchld_default`] so a panic inside the closure cannot
/// leak `SIG_DFL` into the rest of the guest's lifetime — Drop
/// runs even on unwind.
///
/// `libc::signal` returns the previous handler on every call, so
/// the snapshot we capture in `install` is the authoritative value
/// to restore in `Drop`. Re-installing the snapshot makes the
/// guard idempotent across nested calls (an outer guard's restore
/// observes the inner guard's restore as a no-op rebind to the
/// same handler).
/// Run `f` with SIGCHLD temporarily restored to `SIG_DFL` so the
/// kernel does not auto-reap any child spawned inside the closure.
/// `Command::status()` calls `waitpid(2)`, which returns `ECHILD`
/// when SIGCHLD is `SIG_IGN` (the default installed by
/// [`ktstr_guest_init`] for zombie prevention) — losing the real
/// exit status. Restoring `SIG_DFL` for the closure's lifetime
/// re-enables `waitpid` reaping; the post-closure restore puts
/// the previous disposition back so subsequent guest children
/// continue to be auto-reaped without leaking zombies.
///
/// Mirrors the inline save/restore pattern formerly open-coded at
/// the [`ktstr_guest_init`] shell `--exec` site (now also routed
/// through this helper). Both call sites share the same
/// SIGCHLD-vs-`waitpid` hazard; centralising the helper prevents
/// drift between the two implementations.
///
/// Restore is panic-safe via [`SigchldDispositionGuard`]: a panic
/// in `f` runs the guard's `Drop`, which re-installs the previous
/// SIGCHLD handler before unwinding past the helper boundary.
/// Without the guard, a panicking child-spawn site would leak
/// `SIG_DFL` into the rest of the guest, breaking PID 1's zombie
/// reaping for every subsequent fork.
///
/// The closure must reap every child it spawns before returning.
/// Leaving an unreaped child at the boundary where `SIG_IGN` is
/// restored would orphan the zombie until the next reaper cycle.
/// `Command::status()` waits synchronously, so the typical caller
/// satisfies this invariant by construction.
pub
/// Whether `/proc/{pid}` exists. Used as a `waitpid`-free liveness
/// probe: under SIGCHLD `SIG_IGN` the kernel auto-reaps children, so
/// `waitpid` returns `ECHILD` even when the child exited cleanly.
/// `/proc/{pid}` removal is signal-disposition-independent — the
/// directory disappears the moment the kernel finishes
/// `release_task` for the pid (see kernel/exit.c
/// `release_task` → `proc_flush_pid`), regardless of whether
/// `waitpid` ever ran.
///
/// Returns `true` when `/proc/{pid}` exists (process alive or
/// pre-reap), `false` when it does not (process exited and the
/// kernel has dropped the procfs entry).
/// SIGCHLD = SIG_IGN-safe liveness probe via procfs. The guest init
/// installs `SIGCHLD = SIG_IGN` process-wide (see
/// [`with_sigchld_default`] doc) so the kernel auto-reaps children
/// without explicit `waitpid`. Under that disposition `waitpid`
/// returns `ECHILD` even on a clean exit, so a `Command::status` /
/// `Child::wait` is the wrong tool for "is this pid still running".
///
/// `/proc/{pid}` removal is signal-disposition-independent: the
/// directory disappears the moment the kernel finishes `release_task`
/// for the pid (see kernel/exit.c `release_task` →
/// `proc_flush_pid`), regardless of how SIGCHLD is handled. Polling
/// `/proc/{pid}` therefore observes the real exit on every code path
/// where SIGCHLD might be ignored. Returns `true` when `/proc/{pid}`
/// exists (process alive or pre-reap), `false` when it does not
/// (process exited and the kernel has dropped the procfs entry).
pub
/// Outcome reported by a successful [`kill_scheduler_process`] call.
/// Three variants because the operator-visible signal (caller-side
/// logging, sidecar event) differs by how the child responded:
/// already-gone callers know there was nothing to do; sigterm-graceful
/// exit is the scx-convention happy path; sigkill-escalation is the
/// notable case (the scheduler binary either ignored SIGTERM or its
/// userspace signal handler ran too slow against the grace window).
//
// `#[allow(dead_code)]` because the helper has no production caller
// in this commit — the Op::DetachScheduler / Op::RestartScheduler /
// Op::ReplaceScheduler dispatchers that will consume it land in
// follow-up work. Tests in this module exercise every variant + the
// InvalidPid error path, so the helper is verified-correct as it
// lands; the allow becomes a no-op the moment the first production
// caller wires up.
pub
/// Failure modes for [`kill_scheduler_process`]. Both indicate the
/// caller-supplied invariant (a kill-able pid) was violated or the
/// kernel refused to honor a SIGKILL — neither is recoverable at the
/// call site, but both carry distinct operator diagnostics.
pub
/// Send SIGTERM to `pid`, wait up to `sigterm_grace` for the process
/// to exit (observed via `/proc/{pid}` removal), then escalate to
/// SIGKILL if the polite shutdown did not land. Returns the variant
/// that describes how the kill resolved.
///
/// # Why procfs polling instead of `waitpid`
///
/// The guest init installs SIGCHLD = SIG_IGN globally so PID 1 does
/// not have to reap every zombie (see [`with_sigchld_default`] and
/// the doc on [`proc_pid_alive`]). Under that disposition the kernel
/// auto-reaps children before `waitpid` runs, so `waitpid` returns
/// `ECHILD` even on a clean exit. `/proc/{pid}` removal is
/// signal-disposition-independent: the directory disappears the
/// moment the kernel runs `release_task` for the pid, regardless of
/// how SIGCHLD is handled. Polling `/proc/{pid}` therefore observes
/// the real exit on every code path where SIGCHLD might be ignored.
///
/// # Why SIGTERM first, SIGKILL fallback
///
/// scx schedulers (per the upstream
/// `tools/sched_ext/scx_simple.c:71-72` convention) install one
/// shared signal handler for SIGINT + SIGTERM: setting an exit-
/// request flag that the scheduler's main loop polls, then dropping
/// the BPF skeleton which triggers the kernel's `scx_disable_workfn`
/// path. SIGTERM is the safe shutdown signal — every well-behaved
/// scx scheduler honors it. SIGKILL bypasses the userspace handler
/// (final-log-flush, graceful destructor) but the kernel still
/// observes the BPF program refcount drop and runs the disable path,
/// so the kernel-side scheduler state cleans up regardless. SIGKILL
/// after a bounded SIGTERM grace is the strict-correctness fallback
/// for a scheduler binary that has no SIGTERM handler installed or
/// took longer than `sigterm_grace` to exit.
///
/// # Pid lifecycle semantic
///
/// This function does NOT mutate [`SCHED_PID`]. The
/// scheduler-lifecycle dispatcher owns that side channel and is
/// responsible for storing 0 after a successful detach so subsequent
/// liveness checks (`sched_pid()` readers) short-circuit. Keeping
/// the kill helper generic (no implicit singleton-pid assumption)
/// lets unit tests exercise it against any spawned child pid.
///
/// # Poll cadence
///
/// 50ms polling interval — matches the existing
/// [`poll_startup`] cadence so the latency-vs-CPU tradeoff is
/// consistent across the scheduler-lifecycle helpers. The
/// post-SIGKILL grace is the module-level [`POST_SIGKILL_GRACE`]
/// const (see that const's doc for the 200ms-vs-magic-number
/// rationale).
// production callers (Op::*Scheduler dispatch) wire up in follow-up work
pub
/// Post-SIGKILL grace inside [`kill_scheduler_process`]. SIGKILL
/// triggers the kernel's `exit_notify` → `release_task` cascade
/// (kernel/exit.c) which removes `/proc/{pid}`; the wait here covers
/// both the routine reap path (sub-100ms for a simple userspace
/// process) AND the scheduler-lifecycle Op kill path where an scx
/// scheduler's exit blocks on `scx_disable_workfn`
/// (`kernel/sched/ext.c:5923`) tearing down BPF programs from a
/// workqueue. BPF tear-down dominates the SIGKILL→/proc removal
/// latency for scx_* binaries and routinely exceeds 1s on
/// loaded kernels; 2s leaves comfortable headroom while keeping
/// the unit-test fast for the simple-process case (the test
/// closure exits immediately on SIGKILL so the post-SIGKILL poll
/// returns in <50ms).
///
/// A `StillAliveAfterSigkill` firing AFTER this budget indicates a
/// structurally wrong target — D-state hang, kernel UB, BPF cleanup
/// deadlock — and operators should treat the variant as a debug
/// signal, not a transient retry case. Carried as a module-level
/// const so the value is greppable + paired with a single doc
/// explaining the choice rather than left as a magic number at the
/// call site.
const POST_SIGKILL_GRACE: Duration = from_secs;
/// Poll `/proc/{pid}` for absence up to `timeout`, sleeping at the
/// caller's `interval` cadence between checks. Returns `true` if the
/// pid's procfs entry disappears within the budget, `false`
/// otherwise.
///
/// Single source of truth for "wait until the kernel runs
/// release_task for this pid": [`kill_scheduler_process`] uses it to
/// observe SIGTERM / SIGKILL aftermath, and [`poll_startup`]'s
/// pidfd-unavailable fallback uses it to observe early-death during
/// scheduler launch. Both call sites need the same SIG_IGN-safe
/// latency profile, so folding the loop here keeps a future EINTR
/// or signal-pause refinement applied uniformly.
pub