1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
//! Copyright 2026 0xClandestine, Ekryski, TheTom, Ambisphaeric
//! SPDX-License-Identifier: Apache-2.0
use ;
/// Auto-select the best SDPA-prefill MMA kernel for the given dtype + GPU
/// family. Returns the kernel IR ready to dispatch.
///
/// Heuristic:
/// - bf16 + Apple gen-8 (M2): use `mt_sdpa_prefill_mma_bf16` — single-Q
/// dd-loop variant; reduces simdgroup-matrix frag count 22 → 7, freeing
/// register-file room for M2's emulated bf16-MMA path. +14pts vs the
/// 16-Q-preload sibling at bf16 on M2.
/// - bf16 + Apple gen-9+ (M3+): use `mt_sdpa_prefill_mma` — both variants
/// tie on bf16 on M5 (native bf16 MMA, no emulation tax), but the
/// sibling wins f32/f16 by 1pt on idle so we stick with it.
/// - f32 / f16 (any family): use `mt_sdpa_prefill_mma`.
///
/// `family` should be the `Context::chip_family()` value (`None` means
/// "unknown / non-Apple-Silicon target" — fall back to the sibling kernel
/// which has the broadest perf profile).
///
/// Composite numbers via this selector — **median of 5 reruns, clean
/// shell sessions, M2 mini canonical per `feedback_metaltile_bench_on_m2_mini`**:
///
/// | Machine | dtype | Selected | kv_ld=132 | kv_ld=136 | Δ |
/// |---------|-------|----------|----------:|----------:|---:|
/// | M2 mini | f32 | mma | 124% | **127%** | +3 |
/// | M2 mini | f16 | mma | 92% | **96%** | +4 |
/// | M2 mini | bf16 | mma_bf16 | 99% | (n/a) | — |
/// | M5 Max | f32 | mma | 114% | **116%** | +2 |
/// | M5 Max | f16 | mma | 107% | 107% | 0 |
/// | M5 Max | bf16 | mma | 106% | 107% | +1* |
///
/// \* M5 f16 / bf16 deltas are within the 0.9-3.7% noise envelope —
/// effectively a wash. The real wins are **M2 f16 (+4pt)** and **M2/M5 f32 (+2-3pt)**.
/// M2 f16 max under kv_ld=132 was 95, min under kv_ld=136 was 95 — boundary just barely
/// overlaps; median (96 vs 92) cleanly separates. Original single-shot bench claimed
/// 99% — that was a best-case run, not the median. The direction (+4pt) holds; the
/// absolute is more like 96%.
///
/// The `mma_bf16` sibling kept kv_ld=132. Agent B's clean median-of-5
/// sweep found no kv_ld=136 win on `mma_bf16` larger than noise on
/// either rig — the bank-pattern split (4-byte f16 wants +8, 8-byte
/// bf16 wants +4) holds up.
///
/// # Untested hardware
///
/// Heuristic was validated on M2 mini (Apple8/gen-8) and M5 Max
/// (Apple10/gen-17+). The other Apple GPU families are inferred:
///
/// - **M1 (Apple7/gen-7)**: same architectural class as M2 (no native bf16
/// MMA, emulates via fp32). Selector routes bf16 → `mma_bf16` here too,
/// which *should* be the right call but is not measured. If perf is
/// off, suspect the kv_ld=132 bank-skew pad (M1 has different TG memory
/// bank geometry) or barrier density.
/// - **M3 / M4 (Apple9/gen-17)**: native bf16 MMA hardware. Selector
/// routes bf16 → `mma` (16-Q-preload sibling), inferred by analogy to
/// M5. Worth confirming `mma` wins on these too; if not, the `family
/// ≤ 8` cutoff should be tightened to `family ≤ 7`.
/// - **A17/A18 mobile GPUs** (gen-17, gen-18): same family as M3/M4 on
/// paper but TG memory limits and L1 sizes differ; unmeasured.
///
/// Track results in PR notes or a follow-up; nudge the cutoff if M1
/// bf16 regresses or if M3/M4 bf16 prefers `mma_bf16`.