1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
//! Copyright 2026 0xClandestine, Ekryski, TheTom, Ambisphaeric
//! SPDX-License-Identifier: Apache-2.0
//! Smoke kernel for `simdgroup_load` HW intrinsic — first kernel to
//! actually use the `Op::SimdgroupLoad` DSL primitive end-to-end, so
//! future kernels (qmm B-load fast path etc.) have a working call
//! site to reference.
//!
//! What it does:
//! 1. 32-lane simdgroup stages a flat-64 input into TG memory
//! (`tg_tile`, row-major 8×8, row-stride = 8).
//! 2. `simdgroup_load(frag, "tg_tile", 0, 8)` issues the HW-fused
//! coalesced load — one MSL `simdgroup_load(...)` instruction
//! lands the 8×8 tile into the simdgroup-matrix fragment.
//! 3. Per-lane fragment scatter writes the frag back to `dst` in
//! the A/C lane convention, so for `f32` / `f16` the values
//! round-trip byte-exactly.
//!
//! No math, no MMA — this is a plumbing test. If the round-trip
//! preserves values bit-for-bit, the parser → IR → codegen chain for
//! `Op::SimdgroupLoad` is correctly hooked up and the produced MSL
//! issues a real `simdgroup_load(...)` call against threadgroup
//! memory.
//!
//! Lane → element mapping is the **A/C convention** used everywhere
//! else in the codebase (see `mt_mma_probe_a_identity_b_identity`):
//!
//! ```text
//! qid = lane / 4
//! fm = (qid & 4) + ((lane / 2) % 4) ∈ 0..8
//! fn0 = (qid & 2) * 2 + (lane % 2) * 2 ∈ 0..8 (even)
//! fn1 = fn0 + 1 ∈ 0..8 (odd)
//! frag.elem[0] at (fm, fn0) ↔ tg_tile[fm*8 + fn0]
//! frag.elem[1] at (fm, fn1) ↔ tg_tile[fm*8 + fn1]
//! ```
//!
//! Dispatch: grid `[1, 1, 1]`, tpg `[32, 1, 1]` (one simdgroup).
//!
//! Sample MSL the codegen produces (look for these in
//! `cargo run -p metaltile-cli -- inspect mt_sgload_smoke`):
//!
//! ```text
//! threadgroup T tg_tile[64];
//! ...
//! simdgroup_matrix<T, 8, 8> frag;
//! simdgroup_load(frag, &tg_tile[0u], 8, ulong2(0, 0), false);
//! ...
//! ```
use kernel;
use KernelMode;
use crate::;
/// Round-trip an 8×8 tile through TG memory + a simdgroup-matrix
/// fragment via the `simdgroup_load` HW intrinsic. f32 / f16 should
/// produce byte-exact equality between `src` and `dst`.
///
/// Inputs:
/// - `src`: `[64]` flat row-major 8×8 source values
/// Outputs:
/// - `dst`: `[64]` flat row-major 8×8 destination, written from
/// the fragment in A/C lane convention
submit!