1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
//! Expert-IO mode selection — covers both prefill and decode.
//!
//! On 2026-05-20 we tore out the `pread` expert-IO path across both
//! the prefill batched gather (see [`super::expert_io`] and
//! `pread_teardown_landed.md`) and the per-token decode dispatch
//! (see `completed_decode_gen_arc.md`), moving everything to direct
//! GPU reads of the mmap'd expert buffers. That was load-bearing for
//! Qwen3-A3B (+51 % decode on M2 Max) but turned out to break
//! Qwen3-A17B at both phases: its expert working set is many times
//! physical RAM, so the OS can't keep it page-resident and
//! `MTLResidencySet` either fails or thrashes. The GPU then stalls
//! on demand-fault VM activity.
//!
//! The fix is a single runtime gate consulted by every site that
//! used to inspect `ExpertIoMode`:
//!
//! - **`ExpertFiles::attach_to_device`** — skip the
//! `newBufferWithBytesNoCopy` + residency-set pin in `Pread` mode.
//! The on-disk mmap stays for reading via `read_at`; the layer
//! files are not exposed to the GPU at all.
//! - **`MoeGraphScratch::expert_base`** — `Some` in `Pread` mode
//! (a `num_experts * expert_size` staging buffer the prefill
//! gather kernel reads from); `None` in `Mmap` mode.
//! - **`moe_block_forward`** (prefill) — `Pread` arm reads each
//! bucketed expert from disk and uploads it into `expert_base`
//! at `expert_id * expert_size`; `Mmap` arm points the kernel at
//! the layer mmap buffer.
//! - **`moe_dispatch_per_token`** (decode) — `Pread` arm runs the
//! speculative-prefetch / sync-pread state machine from
//! [`super::prefetch`]; `Mmap` arm binds the layer mmap buffer.
//! - **`step_internal_per_token_oracle`** — skips the
//! `prefetch.dispatch` fire in `Mmap` mode.
//!
//! The gate: `total_expert_bytes > 0.75 * physical_ram`, with
//! `MOEFLUX_EXPERT_IO=mmap|pread|auto` as an override for
//! benchmarking. `auto` (the default) runs the gate.
use crate;
/// Which path moves expert bytes onto the GPU. Picked once at
/// [`crate::riir::RsCtx::open`] and threaded into the prefetch state
/// machine, the MoE graph scratch, and `ExpertFiles`. The choice is
/// constant for the session.
/// Total bytes of expert weights across every layer at the active
/// variant's 4-bit layout. This is the working set
/// `ExpertFiles::attach_to_device` would mmap into Metal-resident
/// buffers when every layer file is present, and is what the gate
/// compares against physical RAM.
/// Total physical RAM in bytes via the [`sysinfo`] crate. Sampled
/// once at `open()` — the value is constant for the lifetime of the
/// session.
/// Pick the expert-IO mode for this run. Reads `MOEFLUX_EXPERT_IO`
/// if set (`mmap`, `pread`, or `auto`); otherwise (or on `auto`)
/// applies the `expert_bytes > 0.75 * physical_ram` gate. Logs the
/// decision once on stderr.