1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
//! The SANCTIONED unsafe basement for the single-threaded Linux confinement
//! LAUNCHER (kernel plan §10.8). The ONE quarantine where the launcher's
//! raw-syscall `unsafe` is permitted to live; every `unsafe` block here carries a
//! `LEDGER:<id>` anchor reconciled against `traceability/unsafe_ledger.yaml` by the
//! `structural-check` unsafe-ledger gate (fail-closed). The safe orchestration in
//! `main.rs` (sequencing, the transcript, the decision logic) NEVER contains
//! `unsafe` — it calls down through the narrow wrappers below.
//!
//! ## The async-signal-safety contract (the load-bearing invariant)
//! The launcher creates the workload child via raw `clone3` ([`clone3_child`]) —
//! NOT `std::process::Command` (which would `fork`+`exec` behind a `.spawn()` the
//! single-thread gate bans, and is not under our control). After `clone3` the CHILD
//! branch runs in a window where, post-fork in a (formerly) multi-thread-capable
//! address space, ONLY async-signal-safe syscalls are legal: no heap allocation, no
//! lock, no Rust std that allocates. This basement upholds that by BUILDING EVERY
//! pointer / array / fd the child needs IN THE PARENT, before `clone3`, packed into
//! a [`ChildExecPlan`]; the child branch then only INDEXES that already-allocated
//! memory (copy-on-write after fork — reading touches no allocator) and issues the
//! listed async-signal-safe syscalls (`close`, `sigprocmask`, `fchdir`, `fexecve`,
//! `write`, `_exit`). If a step here cannot honestly be made allocation-free it does
//! NOT belong in the child window.
use ;
// The compiled BPF the coordinator builds in the PARENT (via the bvisor seccomp model)
// and the child installs LAST in its window. `sock_filter` is the kernel-ABI BPF
// instruction; `BpfProgram = Vec<sock_filter>` is the assembled stream. The child only
// READS this pre-built slice (no allocation) when building the stack `sock_fprog`.
pub use ;
// Re-export so the SAFE coordinator (`imp.rs`) can name the owned ruleset type it
// carries from `build_landlock_ruleset` into `clone3_child` without itself depending
// on the `landlock` crate surface.
pub use RulesetCreated;
use CString;
use File;
use ;
use ;
/// `LANDLOCK_CREATE_RULESET_VERSION` (uapi `linux/landlock.h`): asks
/// `landlock_create_ruleset` for the supported ABI version instead of creating a
/// ruleset. Stable kernel ABI constant.
const LANDLOCK_CREATE_RULESET_VERSION: c_uint = 1;
/// The landlock ABI floor the launcher confines at. `ABI::V3` is the access set the
/// parent-side ruleset is built from; the launcher refuses to advertise confinement
/// when the live kernel ABI is below this floor (see [`build_landlock_ruleset`]).
const LANDLOCK_ABI_FLOOR: ABI = ABIV3;
/// The same floor as the raw kernel ABI integer the live probe returns, so the SAFE
/// coordinator can compare [`probe_landlock_abi`]'s result without depending on the
/// `landlock` crate's `ABI` enum. Kept in lockstep with [`LANDLOCK_ABI_FLOOR`].
pub const LANDLOCK_ABI_FLOOR_RAW: i64 = ABIV3 as i64;
/// `CLONE_INTO_CGROUP` (uapi `linux/sched.h`, kernel ≥ 5.7): a `clone3` flag asking
/// the kernel to place the new child DIRECTLY into the cgroup whose fd is in
/// `clone_args.cgroup`, at birth — eliminating the post-fork
/// write-pid-to-`cgroup.procs` migration race. Named here as an explicit `u64`
/// because the value `0x2_0000_0000` is 2^33 (wider than `i32`), while libc types the
/// gnu-linux constant as `c_int`; `clone_args.flags`/`.cgroup` are both `c_ulonglong`.
const CLONE_INTO_CGROUP: u64 = 0x2_0000_0000;
/// `CLONE_NEWUSER` (uapi `linux/sched.h`): a `clone3`/`clone` flag asking the kernel to
/// create the child in a NEW user namespace. Named here as an explicit `u64` (its value
/// `0x1000_0000` fits `i32`, but `clone_args.flags` is `c_ulonglong`, so we keep it wide
/// to OR it into `flags` without a lossy cast). Set ONLY when the plan opts into the
/// userns rendezvous (S8) — the child is born unmapped (overflow uid) and BLOCKS until
/// the parent writes its uid/gid maps and releases it (then it is uid 0 in the userns).
const CLONE_NEWUSER: u64 = 0x1000_0000;
/// `CLONE_NEWNET` (uapi `linux/sched.h`): a `clone3`/`clone` flag asking the kernel to
/// create the child in a NEW, EMPTY network namespace (proof-spine S9 / D3 — the
/// `NetworkDenyAll` mechanism). Named here as an explicit `u64` (its value `0x4000_0000`
/// fits `i32`, but `clone_args.flags` is `c_ulonglong`, so we keep it wide to OR it into
/// `flags` without a lossy cast). Set ONLY when the plan opts into the empty netns — and
/// ONLY ALONGSIDE `CLONE_NEWUSER` (an unprivileged process may create a new netns only when
/// it is also root in a new userns; the caller enforces the pairing). The child is born into
/// an empty netns (only `lo`, with no address + no routes => unreachable, no external interface)
/// so it is structurally unable to reach any network. This is just a FLAG BIT — it adds NO new syscall.
const CLONE_NEWNET: u64 = 0x4000_0000;
/// One declared confinement root the launcher restricts FS access TO: a pre-opened,
/// fstat-validated descriptor (NEVER a path — exec/landlock rides the inherited fd,
/// avoiding the CVE-2019-5736 reopen race) and whether the workload may write beneath
/// it. Read+execute is ALWAYS granted under a root; `writable` additionally grants the
/// write/create access set. Inert plain data the SAFE coordinator fills in.
pub
/// The `fstat`-observed shape of a descriptor: its kind (from `st_mode & S_IFMT`)
/// and whether it was opened writable (from the file-status `O_ACCMODE` flags).
/// Inert plain data — the safe orchestration compares it to the declared
/// `DescriptorShape` without ever touching `unsafe`.
pub
/// A fully pre-built child-execution plan: EVERYTHING the post-`clone3` child needs,
/// allocated in the single-threaded parent BEFORE the fork. The child branch only
/// reads these fields; it never allocates, locks, or grows any of them.
///
/// `argv`/`envp` are NUL-terminated arrays of pointers into the `CString`s held in
/// `_argv_storage`/`_envp_storage` (kept alive for the plan's lifetime so the
/// pointers stay valid). `close_fds` is the scrub close-list. `error_fd` is the
/// `O_CLOEXEC` write end of the error pipe — successful `fexecve` auto-closes it, so
/// the coordinator observes EOF; any failure writes the errno here before `_exit`.
pub
/// Why a [`ChildExecPlan`] could not be built (all in the PARENT, before any fork —
/// allocation here is fine and these are ordinary fallible-build errors).
pub
/// The OPT-IN user-namespace rendezvous fds the child window needs (S8). Bundled so the
/// plan builder carries one optional value instead of two correlated fds: `read` is the
/// sync-pipe READ end the child blocks on; `write` is the child's inherited copy of the
/// WRITE end, which the child closes FIRST (so the parent's fail-closed close yields a
/// clean EOF rather than a deadlock). `None` ⇒ no rendezvous (no-userns path unchanged).
pub
/// Probe the LIVE landlock ABI integer straight from the kernel.
///
/// Returns the supported ABI version (`>= 1`), or `0` when landlock is unavailable
/// (old kernel / disabled LSM). The COORDINATOR floors the confinement at
/// [`LANDLOCK_ABI_FLOOR`]: a probe below that ⇒ the launcher refuses the landlock
/// action (`SetupRefused{MissingPrimitive}`) rather than advertising a confinement it
/// cannot deliver. Pure observation, run in the single-threaded parent before clone3.
pub
/// Build the landlock ruleset restricting FS access to exactly `roots`, IN THE PARENT
/// (before clone3) — async-signal-safety: ALL heap allocation, the
/// `landlock_create_ruleset`/`landlock_add_rule` syscalls, and the rule construction
/// happen HERE; the post-clone3 child only calls `restrict_self` (allocation-free).
///
/// Each rule is built from a [`BorrowedFd`] of the INHERITED root fd — NOT by
/// reopening a path (the CVE-2019-5736 / Leaky-Vessels reopen race the protocol
/// forbids, and strictly better than the backend's `PathFd::new(path)`). Read-only
/// roots get the read access set; writable roots get read+write. Built at
/// [`CompatLevel::HardRequirement`] so a kernel that cannot honor the ruleset fails
/// CLOSED (the caller has already probed the ABI floor, so the requirement is met).
///
/// The `roots` slice is the coordinator-resolved, already-`fstat`-validated root
/// descriptors. Building the ruleset does NOT confine the parent: only `restrict_self`
/// (in the child) applies it. SAFE: the `landlock` crate is pure safe Rust.
///
/// # Errors
/// An `io::Error` if the ruleset cannot be created (e.g. the ABI floor is not met at
/// `HardRequirement`, or a root fd cannot be borrowed) — fail closed, never widen.
pub
/// Render a landlock error as an `io::Error` (coordinator-side, pre-clone3 — the
/// allocation in the message is fine here, never in the child window).
/// Read ALL bytes from an inherited raw fd into an owned `Vec` — used by the
/// COORDINATOR (single-threaded, pre-`clone3`), where heap allocation is fine.
///
/// The fd is adopted into a temporary [`File`], drained, then released WITHOUT
/// closing (`into_raw_fd`) so the caller still owns the underlying descriptor and
/// the launcher fd-accounting stays exact.
///
/// # Errors
/// Any `io::Error` from the read.
pub
/// `fstat` an inherited descriptor and return its observed shape (kind + writable),
/// for the COORDINATOR's handle-verification step. Pure observation — no fd is
/// created, consumed, or mutated.
///
/// # Errors
/// An `io::Error` carrying the `fstat`/`fcntl` errno on failure.
pub
/// Adopt an inherited raw fd as an owned [`File`] for the COORDINATOR to write its
/// transcript (control fd) or read the child's error report (error-pipe read end).
/// The returned `File` OWNS the descriptor and closes it on drop — the caller must
/// pass an fd it intends the launcher to own for the rest of the run.
pub
/// Create the workload child via raw `clone3` and, IN THE CHILD, run the
/// deterministic async-signal-safe `scrub → (optional fchdir) → fexecve` sequence on
/// the PRE-BUILT [`ChildExecPlan`]. Returns the child pid to the PARENT.
///
/// Topology (PERMANENT): coordinator (this process) → workload child → exec target.
/// The launcher NEVER self-execs: `clone3` makes a real child and the parent
/// returns. On success the child's image is replaced by the target; on any child
/// failure the child writes the errno to the error pipe and `_exit(127)`s, and the
/// parent observes the failure via the error pipe + `waitid`.
///
/// `confinement` is the OPTIONAL parent-built landlock ruleset (`None` ⇒ no landlock
/// action scheduled). It is built (all allocation + add_rule syscalls) BEFORE this
/// call by [`build_landlock_ruleset`]; the child applies it via `restrict_self` after
/// the fd scrub and before `fexecve`. The parent branch never touches it (it drops at
/// return, closing only the parent's copy of the ruleset fd — the child holds its own
/// post-clone3 copy, so the parent drop does not affect the child's confinement).
///
/// `cgroup_fd` is the OPTIONAL inherited
/// [`DescriptorRole::CgroupDir`](bvisor::linux::protocol::DescriptorRole::CgroupDir)
/// directory fd
/// (`None` ⇒ no cgroup placement). When `Some`, `clone3` is asked (via
/// `CLONE_INTO_CGROUP`) to place the child DIRECTLY into that prepared leaf at birth,
/// so the workload is resource-confined the instant it exists — no post-fork migration
/// window. The kernel consumes the fd DURING the syscall in the parent; the child never
/// touches it (so the scrub may close its inherited copy harmlessly).
///
/// # Errors
/// An `io::Error` carrying the `clone3` errno if the fork itself fails (the child
/// never exists, so nothing ran) — including an invalid/forbidden cgroup fd, which
/// fails the syscall rather than running the child uncgrouped.
pub
/// The CHILD branch body: the deterministic async-signal-safe sequence. Diverges —
/// it either `fexecve`s (image replaced) or `_exit`s. NEVER returns into Rust, so no
/// destructor runs and no unwinding crosses the fork. Marked `unsafe` because it
/// dereferences the pre-built raw pointer arrays and issues raw syscalls.
///
/// SAFETY: callable ONLY from the `rc == 0` child branch of [`clone3_child`], with a
/// `plan` whose `argv`/`envp`/`close_fds`/`sync_read_fd` were fully built in the parent,
/// an OPTIONAL `confinement` ruleset whose every allocation + `add_rule` syscall ran in
/// the parent, and an OPTIONAL `seccomp` BPF program assembled ENTIRELY in the parent. It
/// indexes only that already-allocated memory and calls only async-signal-safe syscalls —
/// the optional userns-rendezvous blocking `read` (raw `SYS_read` into a stack byte),
/// `restrict_self` (`prctl` + `landlock_restrict_self`), the STANDALONE
/// `prctl(PR_SET_NO_NEW_PRIVS)`, and `seccomp(SECCOMP_SET_MODE_FILTER, ..)` on the pre-built
/// BPF (a fixed stack `sock_fprog`).
unsafe !
/// Report the current errno to the error pipe and `_exit(127)` — async-signal-safe.
/// Diverges. SAFETY: callable only from the child window with a valid `error_fd`.
unsafe !
/// Close one inherited raw fd in the COORDINATOR via the raw `close` syscall. Used
/// to drop the coordinator's own copy of the error-pipe WRITE end after clone3 so
/// the read end can reach EOF. The raw syscall is used (NOT a `File` drop) because
/// std's owned-fd close path aborts the process if the fd is already closed, whereas
/// here a best-effort close is wanted (the child may or may not still share it).
pub
/// Set `FD_CLOEXEC` on an inherited raw fd in the COORDINATOR (parent, single-threaded,
/// pre-clone3). Used on the landlock ruleset fd(s) so a successful workload `fexecve`
/// auto-closes them (no ruleset fd leaks into the workload); the fd stays open across
/// the child's `restrict_self` because CLOEXEC only acts at exec, not before. A failure
/// is ignored — the ruleset is still applied; at worst the fd would leak (the scrub
/// already closes everything else, and the workload cannot misuse a ruleset fd with
/// `NO_NEW_PRIVS` already set).
pub
/// Create the parent→child user-namespace RENDEZVOUS sync pipe in the COORDINATOR
/// (single-threaded, pre-clone3) and return `(read_end, write_end)` as raw fds, both
/// `O_CLOEXEC`. The READ end is packed into the [`ChildExecPlan`] (the child blocks on
/// it post-clone3, inside its new userns); the WRITE end stays with the parent, which
/// writes one byte to RELEASE the child AFTER it has written the uid/gid maps. Both are
/// CLOEXEC so a successful workload `fexecve` cannot leak the pipe — the child reads
/// from the read end BEFORE exec (CLOEXEC acts only at exec), and the parent closes its
/// write end explicitly once the child is released or fail-closed.
///
/// Returned as plain `RawFd`s (NOT owned handles) so the coordinator can place the read
/// end into the inherited-fd-numbered plan and best-effort-close each end with the same
/// raw discipline as the rest of the launcher's inherited fds.
///
/// # Errors
/// An `io::Error` carrying the `pipe2` errno on failure (the userns launch then refuses
/// fail-closed — no child is created).
pub
/// The launcher's effective uid/gid, observed in the COORDINATOR (parent, pre-clone3),
/// for building the userns uid/gid maps (`0 <euid> 1` / `0 <egid> 1`): the child uid 0
/// maps to exactly the unprivileged identity the launcher already runs as.
pub
/// The kernel-ABI `struct sock_fprog` (uapi `linux/filter.h`): a `{ len, filter }` pair
/// pointing at the BPF instruction stream `seccomp(SECCOMP_SET_MODE_FILTER, ..)` installs.
/// libc does not expose it and seccompiler keeps its own copy private, so the launcher
/// basement declares its own `#[repr(C)]` mirror (two plain fields — a `u16` count and a
/// pointer to the pre-built `sock_filter` slice). Built ON THE STACK in the child window
/// (no allocation) from the PARENT-built filter; the kernel `copy_from_user`s the program
/// during the syscall and leaves the memory untouched, so a borrowed pointer is sound.
/// Set `PR_SET_NO_NEW_PRIVS` STANDALONE in the CHILD window (async-signal-safe, S10).
///
/// EXTRACTED from landlock's `restrict_self` (which sets NNP internally) so the seccomp
/// filter can be installed WITHOUT/BEFORE landlock and in the right order: NNP must be set
/// before any unprivileged seccomp filter (the kernel refuses `SECCOMP_SET_MODE_FILTER`
/// from an unprivileged caller otherwise). `prctl` is async-signal-safe and allocates
/// nothing; it is idempotent (landlock setting it again later is harmless). Returns `true`
/// on success; a non-zero `prctl` return ⇒ `false` (the caller fails closed before the
/// filter install, so the target never runs without NNP).
///
/// Install the PARENT-built seccomp BPF filter in the CHILD window via
/// `seccomp(SECCOMP_SET_MODE_FILTER, 0, &fprog)` (async-signal-safe, S10).
///
/// The `program` slice was assembled ENTIRELY in the parent (the bvisor seccomp model's
/// `compile()`); the child only READS it. A fixed `SockFprog` is built ON THE STACK
/// (no heap, no lock) pointing at that slice, and the raw `SYS_seccomp` syscall installs
/// it. The kernel `copy_from_user`s the program during the syscall, so the borrowed
/// pointer needs no ownership. PRECONDITION: `PR_SET_NO_NEW_PRIVS` is already set (see
/// [`set_no_new_privs`]) — call this LAST, after landlock, immediately before `fexecve`.
/// Returns `true` on a successful install; any non-zero return ⇒ `false` (the caller fails
/// closed so the target never runs without the filter).
///
/// Reap a child pid via `waitid(P_PID, …, WEXITED)` in the COORDINATOR (the parent,
/// single-threaded). Best-effort: the launch outcome is decided by the error pipe,
/// not the child's exit code; this only prevents a zombie. Errors are swallowed.
pub
/// Zero-initialise a `clone_args` without naming every per-arch field. A tiny
/// helper so the basement stays arch-portable.