archmage 0.4.0

Safely invoke your intrinsic power, using the tokens granted to you by the CPU. Cast primitive magics faster than any mage alive.
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
# archmage

> Safely invoke your intrinsic power, using the tokens granted to you by the CPU. Cast primitive magics faster than any mage alive.

## CRITICAL: Token/Trait Design (DO NOT MODIFY)

### LLVM x86-64 Microarchitecture Levels

| Level | Features | Token | Trait |
|-------|----------|-------|-------|
| **v1** | SSE, SSE2 (baseline) | None | None (always available) |
| **v2** | + SSE3, SSSE3, SSE4.1, SSE4.2, POPCNT | `X64V2Token` | `HasX64V2` |
| **v3** | + AVX, AVX2, FMA, BMI1, BMI2, F16C | `X64V3Token` / `Desktop64` | Use token directly |
| **v4** | + AVX512F, AVX512BW, AVX512CD, AVX512DQ, AVX512VL | `X64V4Token` / `Avx512Token` | `HasX64V4` |
| **Modern** | + VPOPCNTDQ, IFMA, VBMI, VNNI, BF16, VBMI2, BITALG, VPCLMULQDQ, GFNI, VAES | `Avx512ModernToken` | Use token directly |
| **FP16** | AVX512FP16 (independent) | `Avx512Fp16Token` | Use token directly |

### AArch64 Tokens

| Token | Features | Trait |
|-------|----------|-------|
| `NeonToken` / `Arm64` | neon (always available) | `HasNeon` (baseline) |
| `NeonAesToken` | + aes | `HasNeonAes` |
| `NeonSha3Token` | + sha3 | `HasNeonSha3` |
| `NeonCrcToken` | + crc | Use token directly |

**PROHIBITED:** NO SVE/SVE2 - Rust stable doesn't support SVE intrinsics yet.

### Rules

1. **NO granular x86 traits** - No `HasSse`, `HasSse2`, `HasAvx`, `HasAvx2`, `HasFma`, `HasAvx512f`, `HasAvx512bw`, etc.
2. **Use tier tokens** - `X64V2Token`, `X64V3Token`, `X64V4Token`, `Avx512ModernToken`
3. **Single trait per tier** - `HasX64V2`, `HasX64V4` only
4. **NEON includes fp16** - They always appear together on AArch64
5. **NO SVE** - `SveToken`, `Sve2Token`, `HasSve`, `HasSve2` are PROHIBITED (Rust stable lacks SVE support)

---

## CRITICAL: Documentation Examples

### Always prefer `#[arcane]` over manual `#[target_feature]`

**DO NOT write examples with manual `#[target_feature]` + unsafe wrappers.** The `#[arcane]` macro does this automatically and is the correct pattern for archmage.

```rust
// WRONG - manual #[target_feature] wrapping
#[cfg(target_arch = "x86_64")]
#[inline]
#[target_feature(enable = "avx2", enable = "fma")]
unsafe fn process_inner(data: &[f32]) -> f32 { ... }

#[cfg(target_arch = "x86_64")]
fn process(token: X64V3Token, data: &[f32]) -> f32 {
    unsafe { process_inner(data) }
}

// CORRECT - use #[arcane] (it generates the above automatically)
#[cfg(target_arch = "x86_64")]
#[arcane]
fn process(token: X64V3Token, data: &[f32]) -> f32 {
    // This function body is compiled with #[target_feature(enable = "avx2,fma")]
    // Intrinsics and operators inline properly into single SIMD instructions
    ...
}
```

### Use `safe_unaligned_simd` inside `#[arcane]` functions

**Use `safe_unaligned_simd` directly inside `#[arcane]` functions.** The calls are safe because the target features match.

```rust
// WRONG - raw pointers need unsafe
let v = unsafe { _mm256_loadu_ps(data.as_ptr()) };

// CORRECT - use safe_unaligned_simd (safe inside #[arcane])
let v = safe_unaligned_simd::x86_64::_mm256_loadu_ps(data);
```

## Quick Start

```bash
cargo test                    # Run tests
cargo test --all-features     # Test with all integrations
cargo clippy --all-features   # Lint
just generate                 # Regenerate all generated code
just validate-registry        # Validate token-registry.toml
just validate-tokens          # Validate magetypes safety + try_new() checks
just parity                   # Check API parity across x86/ARM/WASM
just soundness                # Static intrinsic soundness verification
just miri                     # Run magetypes under Miri (detects UB)
just audit                    # Scan for safety-critical code
just intrinsics-refresh       # Re-extract intrinsics from current Rust
just ci                       # Run ALL checks (must pass before push/publish)
```

## CI and Publishing Rules

**ABSOLUTE REQUIREMENT: Run `just ci` (or `just all` or `cargo xtask all`) before ANY push or publish.**

```bash
just ci    # or: just all, cargo xtask ci, cargo xtask all
```

**NEVER run `git push` or `cargo publish` until this passes. No exceptions.**

CI checks (all must pass):
1. `cargo xtask generate` — regenerate all code
2. **Clean worktree check** — no uncommitted changes after generation (HARD FAIL)
3. `cargo xtask validate` — intrinsic safety + try_new() feature verification
4. `cargo xtask parity` — parity check (0 issues remaining)
5. `cargo clippy --features "std macros bytemuck avx512"` — zero warnings
6. `cargo test --features "std macros bytemuck avx512"` — all tests pass
7. `cargo fmt --check` — code is formatted

**Note:** Parity check reports 0 issues. All W128 types have identical APIs across x86/ARM/WASM.

If ANY check fails:
- Do NOT push
- Do NOT publish
- Fix the issue first
- Re-run `just ci` until it passes

## Source of Truth: token-registry.toml

All token definitions, feature sets, trait mappings, and width configurations
live in `token-registry.toml`. Everything else is derived:

- `src/tokens/generated/` — token structs, traits, stubs, generated by xtask
- `archmage-macros/src/generated/` — macro registry, generated by xtask
- `magetypes/src/simd/generated/` — SIMD types, generated by xtask
- `docs/generated/` — intrinsics reference docs, generated by xtask
- `xtask/src/main.rs` validation — reads registry at runtime
- `cargo xtask validate` — verifies try_new() checks match registry
- `cargo xtask parity` — checks API parity across architectures

To add/modify tokens: edit `token-registry.toml`, then `just generate`.

## Core Insight: Rust 1.85+ Changed Everything

As of Rust 1.85, **value-based intrinsics are safe inside `#[target_feature]` functions**:

```rust
#[target_feature(enable = "avx2")]
unsafe fn example() {
    let a = _mm256_setzero_ps();           // SAFE!
    let b = _mm256_add_ps(a, a);           // SAFE!
    let c = _mm256_fmadd_ps(a, a, a);      // SAFE!

    // Only memory ops remain unsafe (raw pointers)
    let v = unsafe { _mm256_loadu_ps(ptr) };  // Still needs unsafe
}
```

This means we **don't need to wrap** arithmetic, shuffle, compare, bitwise, or other value-based intrinsics. Only:
1. **Tokens** - Prove CPU features are available
2. **`#[arcane]` macro** - Enable `#[target_feature]` via token proof
3. **`safe_unaligned_simd`** - Reference-based memory operations (user adds as dependency)

## How `#[arcane]` Works

The macro generates an inner function with `#[target_feature]`:

```rust
// You write:
#[arcane]
fn my_kernel(token: X64V3Token, data: &[f32; 8]) -> [f32; 8] {
    let v = _mm256_setzero_ps();  // Safe!
    // ...
}

// Macro generates:
fn my_kernel(token: X64V3Token, data: &[f32; 8]) -> [f32; 8] {
    #[target_feature(enable = "avx2,fma")]
    unsafe fn inner(data: &[f32; 8]) -> [f32; 8] {
        let v = _mm256_setzero_ps();  // Safe inside #[target_feature]!
        // ...
    }
    // SAFETY: Token proves CPU support was verified via try_new()
    unsafe { inner(data) }
}
```

## Friendly Aliases

| Alias | Token | What it means |
|-------|-------|---------------|
| `Desktop64` | `X64V3Token` | AVX2 + FMA (Haswell 2013+, Zen 1+) |
| `Server64` | `X64V4Token` | + AVX-512 (Xeon 2017+, Zen 4+) |
| `Arm64` | `NeonToken` | NEON + FP16 (all 64-bit ARM) |

```rust
use archmage::{Desktop64, SimdToken, arcane};

#[arcane]
fn process(token: Desktop64, data: &mut [f32; 8]) {
    // AVX2 + FMA intrinsics safe here
}

if let Some(token) = Desktop64::summon() {
    process(token, &mut data);
}
```

## Directory Structure

```
token-registry.toml          # THE source of truth for all token/trait/feature data
spec.md                      # Architecture spec and safety model documentation
archmage/                    # Main crate: tokens, macros, detect
├── src/
│   ├── lib.rs              # Main exports
│   ├── tokens/             # SIMD capability tokens
│   │   ├── mod.rs          # SimdToken trait definition only
│   │   └── generated/      # Generated from token-registry.toml
│   │       ├── mod.rs      # cfg-gated module routing + re-exports
│   │       ├── traits.rs   # Marker traits (Has128BitSimd, HasX64V2, etc.)
│   │       ├── x86.rs      # x86 tokens (v2, v3) + detection
│   │       ├── x86_avx512.rs  # AVX-512 tokens (v4, modern, fp16)
│   │       ├── arm.rs      # ARM tokens + detection
│   │       ├── wasm.rs     # WASM tokens + detection
│   │       ├── x86_stubs.rs   # x86 stubs (try_new → None)
│   │       ├── arm_stubs.rs   # ARM stubs
│   │       └── wasm_stubs.rs  # WASM stubs
archmage-macros/             # Proc-macro crate (#[arcane], #[multiwidth])
└── src/
    ├── lib.rs              # Macro implementation
    └── generated/          # Generated from token-registry.toml
        ├── mod.rs          # Re-exports
        └── registry.rs     # Token→features mappings
magetypes/                   # SIMD types crate (depends on archmage)
├── src/
│   ├── lib.rs              # Exports simd module
│   └── simd/
│       ├── mod.rs          # Re-exports from generated/
│       └── generated/      # Auto-generated SIMD types
│           ├── x86/        # x86-64 types (w128, w256, w512)
│           ├── arm/        # AArch64 types (w128)
│           ├── wasm/       # WASM types (w128)
│           └── polyfill.rs # Width emulation
docs/
└── generated/              # Auto-generated reference docs
    ├── x86-intrinsics-by-token.md
    ├── aarch64-intrinsics-by-token.md
    └── memory-ops-reference.md
xtask/                       # Code generator and validation
└── src/
    ├── main.rs             # Generates everything, validates safety, parity check
    ├── registry.rs         # token-registry.toml parser
    └── token_gen.rs        # Token/trait code generator
```

## CRITICAL: Codegen Style Rules

**NEVER use `writeln!` chains or `write!` chains for code generation.** Use `r#"..."#` raw strings with `formatdoc!` (from the `indoc` crate) instead:

```rust
// WRONG - verbose, hard to read, easy to get wrong
writeln!(code, "/// {doc}").unwrap();
writeln!(code, "pub fn {name}(self) -> Self {{").unwrap();
writeln!(code, "    Self({body})").unwrap();
writeln!(code, "}}").unwrap();

// CORRECT - use formatdoc! with raw strings
use indoc::formatdoc;
code.push_str(&formatdoc! {r#"
    /// {doc}
    pub fn {name}(self) -> Self {{
        Self({body})
    }}
"#});
```

For method generation helpers, use the helpers in `xtask/src/simd_types/types.rs`:

```rust
use super::types::{gen_unary_method, gen_binary_method, gen_scalar_method};

code.push_str(&gen_unary_method("Compute absolute value", "abs", "Self(_mm256_abs_epi32(self.0))"));
code.push_str(&gen_binary_method("Add two vectors", "add", "Self(_mm256_add_epi32(self.0, other.0))"));
code.push_str(&gen_scalar_method("Extract first element", "first", "i32", "_mm_cvtsi128_si32(self.0)"));
```

## Token Hierarchy

**x86:**
- `X64V2Token` - SSE4.2 + POPCNT (Nehalem 2008+)
- `X64V3Token` / `Desktop64` / `X64V3Token` - AVX2 + FMA + BMI2 (Haswell 2013+, Zen 1+)
- `X64V4Token` / `Avx512Token` - + AVX-512 F/BW/CD/DQ/VL (Skylake-X 2017+, Zen 4+)
- `Avx512ModernToken` - + modern extensions (Ice Lake 2019+, Zen 4+)
- `Avx512Fp16Token` - + FP16 (Sapphire Rapids 2023+)

**ARM:**
- `NeonToken` / `Arm64` - NEON (baseline, always available)
- `NeonAesToken` - + AES
- `NeonSha3Token` - + SHA3
- `NeonCrcToken` - + CRC

## Tier Traits

Only two tier traits exist for generic bounds:

```rust
fn requires_v2(token: impl HasX64V2) { ... }
fn requires_v4(token: impl HasX64V4) { ... }
fn requires_neon(token: impl HasNeon) { ... }
```

For v3 (AVX2+FMA), use `X64V3Token` directly - it's the recommended baseline.

## SIMD Types (magetypes crate)

Token-gated SIMD types live in the **magetypes** crate:

```rust
use archmage::{X64V3Token, SimdToken};
use magetypes::simd::f32x8;

if let Some(token) = X64V3Token::summon() {
    let a = f32x8::splat(token, 1.0);
    let b = f32x8::splat(token, 2.0);
    let c = a + b;  // Natural operators!
}
```

For multiwidth code, use `magetypes::simd::*`:

```rust
use archmage::multiwidth;

#[multiwidth]
mod kernels {
    use magetypes::simd::*;

    pub fn sum(token: Token, data: &[f32]) -> f32 {
        let mut acc = f32xN::zero(token);
        // ...
    }
}
```

## Safe Memory Operations

Use `safe_unaligned_simd` directly inside `#[arcane]` functions:

```rust
use archmage::{Desktop64, SimdToken, arcane};

#[arcane]
fn process(_token: Desktop64, data: &[f32; 8]) -> [f32; 8] {
    // safe_unaligned_simd calls are SAFE inside #[arcane]
    let v = safe_unaligned_simd::x86_64::_mm256_loadu_ps(data);
    let squared = _mm256_mul_ps(v, v);
    let mut out = [0.0f32; 8];
    safe_unaligned_simd::x86_64::_mm256_storeu_ps(&mut out, squared);
    out
}
```

## Pending Work

### API Parity Status (0 issues — complete!)

**Current state:** All W128 types have identical APIs across x86/ARM/WASM. Reduced from 270 → 0 parity issues (100%).

Run `cargo xtask parity` to verify.

### Known Cross-Architecture Behavioral Differences

These are documented semantic differences between architectures. Tests must account for them; they are not bugs to fix.

| Issue | x86 | ARM | WASM | Workaround |
|-------|-----|-----|------|------------|
| Bitwise operators (`&`, `\|`, `^`) on integers | Trait impls (operators work) | Methods only | Methods only | Use `.and()`, `.or()`, `.xor()` methods |
| `shr` for signed integers | Logical (zero-fill) | Arithmetic (sign-extend) | Arithmetic (sign-extend) | Use `shr_arithmetic` for portable sign-extending shift |
| `blend` signature | `(mask, true, false)` | `(mask, true, false)` | `(self, other, mask)` | Avoid in portable code; use bitcast + comparison verification |
| `interleave_lo/hi` | f32x4 only | f32x4 only | f32x4 only | Only use on f32x4, not integer types |

### Long-Term

- **Generator test fixtures**: Add example input/expected output pairs to each xtask generator (SIMD types, width dispatch, tokens, macro registry). These serve as both documentation of expected output and cross-platform regression tests — run on x86, ARM, and WASM to catch codegen divergence.

### Completed

- ~~**WASM u64x2 ordering comparisons**~~: Done. Added simd_lt/le/gt/ge via bias-to-signed polyfill (XOR with i64::MIN, then i64x2_lt/gt). Parity: 4 → 0.
- ~~**x86 byte shift polyfills**~~: Done. Added i8x16/u8x16 shl, shr, shr_arithmetic for all x86 widths. Uses 16-bit shift + byte mask (~2 instructions). AVX-512 shr_arithmetic uses mask registers. Parity: 9 → 4.
- ~~**All actionable parity issues**~~: Done. Closed 28 remaining issues: extend/pack ops (17), RGBA pixel ops (4), i64/u64 polyfill math (7). Parity: 37 → 9 (0 actionable).
- ~~**ARM/WASM block ops**~~: Done. ARM uses native vzip1q/vzip2q, WASM uses i32x4_shuffle. Both gained interleave_lo/hi, interleave, deinterleave_4ch, interleave_4ch, transpose_4x4, transpose_4x4_copy. Parity: 47 → 37.
- ~~**WASM cbrt + f64x2 log10_lowp**~~: Done. WASM f32x4 gained cbrt_midp/cbrt_midp_precise (scalar initial guess + Newton-Raphson). WASM f64x2 gained log10_lowp via scalar fallback.
- ~~**ARM transcendentals + x86 missing variants**~~: Done. ARM f32x4 has full lowp+midp transcendentals (log2, exp2, ln, exp, log10, pow, cbrt) with all variant coverage. ARM f64x2 has lowp transcendentals via scalar fallback. x86 gained lowp _unchecked aliases, midp _precise variants, and log10_midp family. Parity: 80 → 47.
- ~~**API surface parity detection tool**~~: Done. Use `cargo xtask parity` to detect API variances between x86/ARM/WASM.
- ~~**Move generated files to subfolder**~~: Done. All generated code now lives in `generated/` subfolders.
- ~~**Merge WASM transcendentals from `feat/wasm128`**~~: Done (354dc2b). All `_unchecked` and `_precise` variants now generated.
- ~~**ARM comparison ops**~~: Done. Added simd_eq, simd_ne, simd_lt, simd_le, simd_gt, simd_ge, blend.
- ~~**ARM bitwise ops**~~: Done. Added not, shl, shr for all integer types.
- ~~**ARM boolean reductions**~~: Done. Added all_true, any_true, bitmask for all integer types.
- ~~**x86 boolean reductions**~~: Done. Added all_true, any_true, bitmask for all integer types (128/256/512-bit).
- ~~**WASM bytemuck methods**~~: Done. Added cast_slice, cast_slice_mut, as_bytes, as_bytes_mut, from_bytes, from_bytes_owned.
- ~~**ARM reduce_add for unsigned**~~: Done. Extended reduce_add to all integer types including unsigned.
- ~~**Approximations (rcp, rsqrt) for ARM/WASM**~~: Done. ARM uses native vrecpe/vrsqrte, WASM uses division.
- ~~**mul_sub for ARM/WASM**~~: Done. ARM uses vfma with negation, WASM uses mul+sub.
- ~~**Type conversions for ARM/WASM**~~: Done. Added to_i32x4, to_i32x4_round, from_i32x4, to_f32x4, to_i32x4_low.
- ~~**shr_arithmetic for ARM/WASM**~~: Done. Added for i8x16, i16x8, i32x4.

## Suboptimal Intrinsics (needs faster-path overloads)

Track places where we use polyfills or slower instruction sequences because the base token lacks a native intrinsic, but a higher token would have one. Each entry should get a method overload that accepts the higher token for the fast path.

| Method | Token (slow) | Polyfill | Token (fast) | Native Intrinsic | Status |
|--------|-------------|----------|-------------|------------------|--------|
| f32 cbrt initial guess | all tokens | scalar extract + bit hack || No SIMD cbrt exists; consider SIMD bit hack via integer ops | Low priority |

**Rules for this section:**
- Only add entries when you've verified the faster intrinsic exists and is correct.
- The overload should take the higher token as a parameter (e.g., `fn min_fast(self, other: Self, _: X64V4Token) -> Self`).
- Or use trait bounds: `fn min<T: HasX64V4>(self, other: Self, _: T) -> Self` for the fast path.
- Remove entries when the fast-path overload is implemented.

### Completed fast-path overloads

All i64/u64 min/max/abs now have `_fast` variants that take `X64V4Token`:
- `i64x2::min_fast`, `max_fast`, `abs_fast`
- `u64x2::min_fast`, `max_fast`
- `i64x4::min_fast`, `max_fast`, `abs_fast`
- `u64x4::min_fast`, `max_fast`

## License

MIT OR Apache-2.0