archmage 0.6.0

Safely invoke your intrinsic power, using the tokens granted to you by the CPU. Cast primitive magics faster than any mage alive.
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
# archmage

> Safely invoke your intrinsic power, using the tokens granted to you by the CPU. Cast primitive magics faster than any mage alive.

## CRITICAL: Naming Conventions

**Use the thematic names, not the boring ones:**

| ❌ Don't use | ✅ Use instead | Notes |
|-------------|----------------|-------|
| `#[simd_fn]` | `#[arcane]` | `simd_fn` exists only for migration |
| `try_new()` | `summon()` | `try_new` exists only for migration |

**We are mages, not bureaucrats.** Write `Token::summon()`, not `Token::try_new()`.

## Reference: CPU Features, Detection, and Dispatch

### The Core Distinction: Compile-Time vs Runtime

| Mechanism | When | Effect |
|-----------|------|--------|
| `#[cfg(target_arch = "...")]` | Compile | Include/exclude code from binary |
| `#[cfg(target_feature = "...")]` | Compile | True only if feature is in target spec |
| `#[cfg(feature = "...")]` | Compile | Cargo feature flag |
| `-Ctarget-cpu=native` | Compile | LLVM assumes current CPU's features |
| `is_x86_feature_detected!()` | Runtime | CPUID instruction |
| `Token::summon()` | Runtime | Archmage's detection (compiles away when guaranteed) |

**Tokens exist everywhere.** `Desktop64`, `Arm64`, etc. compile on all platforms—`summon()` just returns `None` on unsupported architectures. This means **you rarely need `#[cfg(target_arch)]` guards** in user code. The stubs handle cross-compilation cleanly.

### CRITICAL: Target-Feature Boundaries (4x Performance Impact)

**Enter `#[arcane]` once at the top, use `#[rite]` for everything inside.**

LLVM cannot inline across mismatched `#[target_feature]` attributes. Each `#[arcane]` call from non-SIMD code creates an optimization boundary — LLVM can't hoist loads, sink stores, or vectorize across it. This costs 4-6x depending on workload (see `benches/asm_inspection.rs` and `docs/PERFORMANCE.md`). Token hoisting doesn't help — even with the token pre-summoned, calling `#[arcane]` per iteration still hits the boundary.

```rust
// WRONG: #[arcane] boundary every iteration (4x slower)
#[arcane]
fn dist_simd(token: X64V3Token, a: &[f32; 8], b: &[f32; 8]) -> f32 { ... }

fn process_all(points: &[[f32; 8]]) {
    let token = X64V3Token::summon().unwrap(); // hoisted — doesn't help!
    for i in 0..points.len() {
        for j in i+1..points.len() {
            dist_simd(token, &points[i], &points[j]); // boundary per call
        }
    }
}
```

```rust
// RIGHT: one #[arcane] entry, #[rite] helpers inline freely
fn process_all(points: &[[f32; 8]]) {
    if let Some(token) = X64V3Token::summon() {
        process_all_simd(token, points);
    } else {
        process_all_scalar(points);
    }
}

#[arcane]
fn process_all_simd(token: X64V3Token, points: &[[f32; 8]]) {
    for i in 0..points.len() {
        for j in i+1..points.len() {
            dist_simd(token, &points[i], &points[j]); // #[rite] inlines here
        }
    }
}

#[rite]
fn dist_simd(token: X64V3Token, a: &[f32; 8], b: &[f32; 8]) -> f32 {
    // Inlines into process_all_simd — same LLVM optimization region
    ...
}
```

**The rule:** `#[arcane]` at the entry point, `#[rite]` for everything called from SIMD code.

### CRITICAL: Generic Bounds Are Optimization Barriers

**Generic passthrough with trait bounds breaks inlining.** The compiler cannot inline across generic boundaries — each trait-bounded call is a potential indirect call.

```rust
// BAD: Generic bound prevents inlining into caller
#[arcane]
fn process_generic<T: HasX64V2>(token: T, data: &[f32]) -> f32 {
    inner_work(token, data)  // Can't inline — T could be any type
}

#[arcane]
fn inner_work<T: HasX64V2>(token: T, data: &[f32]) -> f32 {
    // Even with #[inline(always)], this may not inline through generic
    ...
}
```

```rust
// GOOD: Concrete token enables full inlining
#[arcane]
fn process_concrete(token: X64V3Token, data: &[f32]) -> f32 {
    inner_work(token, data)  // Fully inlinable — concrete type
}

#[arcane]
fn inner_work(token: X64V3Token, data: &[f32]) -> f32 {
    // Inlines into caller, single #[target_feature] region
    ...
}
```

**Why this matters:**
- `#[target_feature]` functions inline to share the feature-enabled region
- Generic bounds break this chain — each function is a separate compilation unit
- Even `#[inline(always)]` can't force inlining across trait object boundaries

**Downcasting is free:** Pass a higher token to a function expecting a lower one. Nested `#[arcane]` with downcasting preserves the inlining chain:

```rust
#[arcane]
fn v4_kernel(token: X64V4Token, data: &mut [f32]) {
    // Can call V3 functions — V4 is a superset
    let partial = v3_helper(token, &data[..8]);  // Downcasts, still inlines
    // ... AVX-512 specific work ...
}

#[arcane]
fn v3_helper(token: X64V3Token, chunk: &[f32]) -> f32 {
    // AVX2+FMA work — inlines into v4_kernel
    ...
}
```

**Upcasting via `IntoConcreteToken`:** Safe, but creates an LLVM optimization boundary:

```rust
fn process<T: IntoConcreteToken>(token: T, data: &mut [f32]) {
    // Generic caller has baseline LLVM target
    if let Some(v4) = token.as_x64v4() {
        process_v4(v4, data);  // Callee has AVX-512 target — mismatched
    } else if let Some(v3) = token.as_x64v3() {
        process_v3(v3, data);
    }
}
```

The issue: `#[target_feature]` changes LLVM's target for that function. Generic caller and feature-enabled callee have mismatched targets, so LLVM can't optimize across that boundary. Do dispatch once at entry, not deep in hot code.

**The rule:** Use concrete tokens for hot paths. Downcasting (V4→V3) is free. Upcasting via `IntoConcreteToken` is safe but creates optimization boundaries.

### When `-Ctarget-cpu=native` Is Fine

**Use it when:** Building for your own machine or known deployment target.

```bash
RUSTFLAGS="-Ctarget-cpu=native" cargo build --release
```

**Detection compiles away:** With `-Ctarget-cpu=haswell`, `X64V3Token::compiled_with()` returns `Some(true)` and `summon()` becomes a no-op. The compiler elides the check entirely.

**Don't use it when:** Distributing binaries to unknown CPUs.

### How `#[cfg(target_feature)]` Actually Works

```rust
// TRUE only if compiled with -Ctarget-cpu=haswell or -Ctarget-feature=+avx2
#[cfg(target_feature = "avx2")]
fn only_with_avx2_target() { }

// ALWAYS true on x86_64 (baseline)
#[cfg(target_feature = "sse2")]
fn always_on_x86_64() { }
```

Default `x86_64-unknown-linux-gnu` only enables SSE/SSE2. Extended features require `-Ctarget-cpu` or `-Ctarget-feature`.

### The Cargo Feature Trap

**WRONG:** Gating type aliases on cargo features:

```rust
// BAD: Types don't exist unless cargo feature enabled!
#[cfg(all(target_arch = "x86_64", feature = "avx512"))]
pub use crate::simd::x86::w512::f32x16 as F32Vec;
```

This breaks runtime dispatch — types aren't available even if CPU supports AVX-512.

**Cargo features should control:**
- Whether to *attempt* higher tiers at runtime
- Compile-time-only paths for known targets

**Cargo features should NOT control:**
- Whether SIMD types exist

### `#[arcane]`: Cross-Arch Compilation

On wrong architecture, generates unreachable stub:

```rust
// On ARM: stub that compiles but can't be reached
#[cfg(not(target_arch = "x86_64"))]
fn process(token: X64V3Token, data: &[f32; 8]) -> [f32; 8] {
    unreachable!("X64V3Token cannot exist on this architecture")
}
```

### `#[arcane]` with Methods

Use `_self = Type` and reference `_self` in body:

```rust
impl Processor {
    #[arcane(_self = Processor)]
    fn process(&self, token: X64V3Token, data: &[f32; 8]) -> f32 {
        _self.threshold  // Use _self, not self
    }
}
```

### `incant!`: Dispatch Macro

```rust
use archmage::incant;

pub fn sum(data: &[f32]) -> f32 {
    incant!(sum(data))
}

// Requires suffixed functions:
// sum_v3(token: X64V3Token, ...)
// sum_v4(token: X64V4Token, ...)     // if feature = "avx512"
// sum_neon(token: NeonToken, ...)
// sum_wasm128(token: Wasm128Token, ...)
// sum_scalar(token: ScalarToken, ...)
```

**Passthrough mode** (already have token):

```rust
fn inner<T: IntoConcreteToken>(token: T, data: &[f32]) -> f32 {
    incant!(with token => process(data))
}
```

### `ScalarToken`

Always-available fallback. Used for:
- `incant!()` convention (`_scalar` suffix)
- Consistent API shape in dispatch

### Fixed-Size Types with Polyfills

**Pick a concrete size. Use polyfills for portability.**

```rust
use magetypes::simd::f32x8;  // Always 8 lanes, polyfilled on ARM/WASM

#[arcane]
fn process(token: X64V3Token, data: &[f32; 8]) -> f32 {
    let v = f32x8::load(token, data);
    v.reduce_add()
}
```

On ARM, `f32x8` is emulated with two `f32x4` operations. The API is identical.

---

## CRITICAL: Token/Trait Design (DO NOT MODIFY)

### LLVM x86-64 Microarchitecture Levels

| Level | Features | Token | Trait |
|-------|----------|-------|-------|
| **v1** | SSE, SSE2 (baseline) | None | None (always available) |
| **v2** | + SSE3, SSSE3, SSE4.1, SSE4.2, POPCNT | `X64V2Token` | `HasX64V2` |
| **v3** | + AVX, AVX2, FMA, BMI1, BMI2, F16C | `X64V3Token` / `Desktop64` | Use token directly |
| **v4** | + AVX512F, AVX512BW, AVX512CD, AVX512DQ, AVX512VL | `X64V4Token` / `Avx512Token` | `HasX64V4` |
| **Modern** | + VPOPCNTDQ, IFMA, VBMI, VNNI, BF16, VBMI2, BITALG, VPCLMULQDQ, GFNI, VAES | `Avx512ModernToken` | Use token directly |
| **FP16** | AVX512FP16 (independent) | `Avx512Fp16Token` | Use token directly |

### AArch64 Tokens

| Token | Features | Trait |
|-------|----------|-------|
| `NeonToken` / `Arm64` | neon | `HasNeon` |
| `NeonAesToken` | + aes | `HasNeonAes` |
| `NeonSha3Token` | + sha3 | `HasNeonSha3` |
| `NeonCrcToken` | + crc | Use token directly |

**PROHIBITED:** NO SVE/SVE2 - Rust stable doesn't support SVE intrinsics yet.

### Rules

1. **NO granular x86 traits** - No `HasSse`, `HasSse2`, `HasAvx`, `HasAvx2`, `HasFma`, `HasAvx512f`, `HasAvx512bw`, etc.
2. **Use tier tokens** - `X64V2Token`, `X64V3Token`, `X64V4Token`, `Avx512ModernToken`
3. **Single trait per tier** - `HasX64V2`, `HasX64V4` only
4. **NEON includes fp16** - They always appear together on AArch64
5. **NO SVE** - `SveToken`, `Sve2Token`, `HasSve`, `HasSve2` are PROHIBITED (Rust stable lacks SVE support)
6. **NO WIDTH TRAITS** - `Has128BitSimd`, `Has256BitSimd`, `Has512BitSimd` are DEPRECATED and will be removed:
   - `Has256BitSimd` only enables AVX, **NOT AVX2 or FMA** — misleading and causes suboptimal codegen
   - Use concrete tokens (`X64V3Token`) or feature traits (`HasX64V2`, `HasX64V4`) instead

---

## CRITICAL: Documentation Examples

### Prefer `#[rite]` for internal code, `#[arcane]` only at entry points

**`#[rite]` should be the default.** It adds `#[target_feature]` + `#[inline]` — LLVM inlines it into any caller with matching features.

Use `#[arcane]` only when the function is called from non-SIMD code:
- After `summon()` in a public API
- From tests
- From non-`#[target_feature]` contexts

```rust
// Entry point (called after summon) - use #[arcane]
#[arcane]
pub fn process(token: Desktop64, data: &mut [f32]) {
    for chunk in data.chunks_exact_mut(8) {
        process_chunk(token, chunk);  // Calls #[rite]
    }
}

// Internal helper (called from #[arcane] or #[rite]) - use #[rite]
#[rite]
fn process_chunk(_: Desktop64, chunk: &mut [f32; 8]) {
    // ...
}
```

### Never use manual `#[target_feature]`

**DO NOT write examples with manual `#[target_feature]` + unsafe wrappers.** Use `#[arcane]` or `#[rite]` instead.

```rust
// WRONG - manual #[target_feature] wrapping
#[cfg(target_arch = "x86_64")]
#[inline]
#[target_feature(enable = "avx2", enable = "fma")]
unsafe fn process_inner(data: &[f32]) -> f32 { ... }

#[cfg(target_arch = "x86_64")]
fn process(token: X64V3Token, data: &[f32]) -> f32 {
    unsafe { process_inner(data) }
}

// CORRECT - use #[arcane] (generates #[target_feature] + stubs on other arches)
#[arcane]
fn process(token: X64V3Token, data: &[f32]) -> f32 {
    // This function body is compiled with #[target_feature(enable = "avx2,fma")]
    // Intrinsics and operators inline properly into single SIMD instructions
    ...
}
```

### Use `safe_unaligned_simd` inside `#[arcane]` functions

**Use `safe_unaligned_simd` directly inside `#[arcane]` functions.** The calls are safe because the target features match.

```rust
// WRONG - raw pointers need unsafe
let v = unsafe { _mm256_loadu_ps(data.as_ptr()) };

// CORRECT - use safe_unaligned_simd (safe inside #[arcane])
let v = safe_unaligned_simd::x86_64::_mm256_loadu_ps(data);
```

## Quick Start

```bash
cargo test                    # Run tests
cargo test --all-features     # Test with all integrations
cargo clippy --all-features   # Lint
just generate                 # Regenerate all generated code
just validate-registry        # Validate token-registry.toml
just validate-tokens          # Validate magetypes safety + summon() checks
just parity                   # Check API parity across x86/ARM/WASM
just soundness                # Static intrinsic soundness verification
just miri                     # Run magetypes under Miri (detects UB)
just audit                    # Scan for safety-critical code
just intrinsics-refresh       # Re-extract intrinsics from current Rust
just ci                       # Run ALL checks (must pass before push/publish)
```

## CI and Publishing Rules

**ABSOLUTE REQUIREMENT: Run `just ci` (or `just all` or `cargo xtask all`) before ANY push or publish.**

```bash
just ci    # or: just all, cargo xtask ci, cargo xtask all
```

**NEVER run `git push` or `cargo publish` until this passes. No exceptions.**

CI checks (all must pass):
1. `cargo xtask generate` — regenerate all code
2. **Clean worktree check** — no uncommitted changes after generation (HARD FAIL)
3. `cargo xtask validate` — intrinsic safety + summon() feature verification
4. `cargo xtask parity` — parity check (0 issues remaining)
5. `cargo clippy --features "std macros avx512"` — zero warnings
6. `cargo test --features "std macros avx512"` — all tests pass
7. `cargo fmt --check` — code is formatted

**Note:** Parity check reports 0 issues. All W128 types have identical APIs across x86/ARM/WASM.

If ANY check fails:
- Do NOT push
- Do NOT publish
- Fix the issue first
- Re-run `just ci` until it passes

**Git tags are MANDATORY for every publish.** After `cargo publish`, immediately create tags:

```bash
git tag v{version}                        # archmage
git tag archmage-macros-v{version}        # archmage-macros
git tag magetypes-v{version}              # magetypes
git push origin v{version} archmage-macros-v{version} magetypes-v{version}
```

Publish order (respect dependency chain): `archmage-macros` → `archmage` → `magetypes`.

## Source of Truth: token-registry.toml

All token definitions, feature sets, trait mappings, and width configurations
live in `token-registry.toml`. Everything else is derived:

- `src/tokens/generated/` — token structs, traits, stubs, generated by xtask
- `archmage-macros/src/generated/` — macro registry, generated by xtask
- `magetypes/src/simd/generated/` — SIMD types, generated by xtask
- `docs/generated/` — intrinsics reference docs, generated by xtask
- `xtask/src/main.rs` validation — reads registry at runtime
- `cargo xtask validate` — verifies summon() checks match registry
- `cargo xtask parity` — checks API parity across architectures

To add/modify tokens: edit `token-registry.toml`, then `just generate`.

## Core Insight: Rust 1.85+ Changed Everything

As of Rust 1.85, **value-based intrinsics are safe inside `#[target_feature]` functions**:

```rust
#[target_feature(enable = "avx2")]
unsafe fn example() {
    let a = _mm256_setzero_ps();           // SAFE!
    let b = _mm256_add_ps(a, a);           // SAFE!
    let c = _mm256_fmadd_ps(a, a, a);      // SAFE!

    // Only memory ops remain unsafe (raw pointers)
    let v = unsafe { _mm256_loadu_ps(ptr) };  // Still needs unsafe
}
```

This means we **don't need to wrap** arithmetic, shuffle, compare, bitwise, or other value-based intrinsics. Only:
1. **Tokens** - Prove CPU features are available
2. **`#[arcane]` macro** - Enable `#[target_feature]` via token proof
3. **`safe_unaligned_simd`** - Reference-based memory operations (user adds as dependency)

## How `#[arcane]` Works

The macro generates an inner function with `#[target_feature]`:

```rust
// You write:
#[arcane]
fn my_kernel(token: X64V3Token, data: &[f32; 8]) -> [f32; 8] {
    let v = _mm256_setzero_ps();  // Safe!
    // ...
}

// Macro generates:
fn my_kernel(token: X64V3Token, data: &[f32; 8]) -> [f32; 8] {
    #[target_feature(enable = "avx2,fma")]
    fn inner(data: &[f32; 8]) -> [f32; 8] {
        let v = _mm256_setzero_ps();  // Safe inside #[target_feature]!
        // ...
    }
    // SAFETY: Calling #[target_feature] fn from non-matching context.
    // Token proves CPU support was verified via summon().
    unsafe { inner(data) }
}
```

## Friendly Aliases

| Alias | Token | What it means |
|-------|-------|---------------|
| `Desktop64` | `X64V3Token` | AVX2 + FMA (Haswell 2013+, Zen 1+) |
| `Server64` | `X64V4Token` | + AVX-512 (Xeon 2017+, Zen 4+) |
| `Arm64` | `NeonToken` | NEON + FP16 (all 64-bit ARM) |

```rust
use archmage::{Desktop64, SimdToken, arcane};

#[arcane]
fn process(token: Desktop64, data: &mut [f32; 8]) {
    // AVX2 + FMA intrinsics safe here
}

if let Some(token) = Desktop64::summon() {
    process(token, &mut data);
}
```

## Directory Structure

```
token-registry.toml          # THE source of truth for all token/trait/feature data
spec.md                      # Architecture spec and safety model documentation
archmage/                    # Main crate: tokens, macros, detect
├── src/
│   ├── lib.rs              # Main exports
│   ├── tokens/             # SIMD capability tokens
│   │   ├── mod.rs          # SimdToken trait definition only
│   │   └── generated/      # Generated from token-registry.toml
│   │       ├── mod.rs      # cfg-gated module routing + re-exports
│   │       ├── traits.rs   # Marker traits (Has128BitSimd, HasX64V2, etc.)
│   │       ├── x86.rs      # x86 tokens (v2, v3) + detection
│   │       ├── x86_avx512.rs  # AVX-512 tokens (v4, modern, fp16)
│   │       ├── arm.rs      # ARM tokens + detection
│   │       ├── wasm.rs     # WASM tokens + detection
│   │       ├── x86_stubs.rs   # x86 stubs (summon → None)
│   │       ├── arm_stubs.rs   # ARM stubs
│   │       └── wasm_stubs.rs  # WASM stubs
archmage-macros/             # Proc-macro crate (#[arcane], #[rite], #[magetypes], incant!)
└── src/
    ├── lib.rs              # Macro implementation
    └── generated/          # Generated from token-registry.toml
        ├── mod.rs          # Re-exports
        └── registry.rs     # Token→features mappings
magetypes/                   # SIMD types crate (depends on archmage)
├── src/
│   ├── lib.rs              # Exports simd module
│   └── simd/
│       ├── mod.rs          # Re-exports from generated/
│       └── generated/      # Auto-generated SIMD types
│           ├── x86/        # x86-64 types (w128, w256, w512)
│           ├── arm/        # AArch64 types (w128)
│           ├── wasm/       # WASM types (w128)
│           └── polyfill.rs # Width emulation
docs/
└── generated/              # Auto-generated reference docs
    ├── x86-intrinsics-by-token.md
    ├── aarch64-intrinsics-by-token.md
    └── memory-ops-reference.md
xtask/                       # Code generator and validation
└── src/
    ├── main.rs             # Generates everything, validates safety, parity check
    ├── registry.rs         # token-registry.toml parser
    └── token_gen.rs        # Token/trait code generator
```

## CRITICAL: Codegen Style Rules — NO `writeln!` CHAINS

**THIS IS MANDATORY. ALL codegen MUST use `formatdoc!` from the `indoc` crate.**

`writeln!` chains are the single biggest readability problem in our codegen. They turn 10 lines of readable Rust into 40 lines of string-escaping noise where you can't see the generated code's structure. Every `{{` and `}}` is a bug waiting to happen. Every `.unwrap()` is visual clutter. Stop it.

### The rule

1. **Use `formatdoc!` with raw strings** for any block of generated code (2+ lines)
2. **Use `format!` or string literals** for single-line fragments only
3. **`writeln!` is BANNED** except for trivial single-line output to stdout/stderr (like progress messages)
4. **`indoc` is already a dependency** of xtask — there is zero excuse

### Pattern: `formatdoc!` with `push_str`

```rust
use indoc::formatdoc;

// WRONG — unreadable writeln! soup (actual current state of token_gen.rs)
writeln!(code, "impl SimdToken for {name} {{").unwrap();
writeln!(code, "    const NAME: &'static str = \"{display_name}\";").unwrap();
writeln!(code, "").unwrap();
writeln!(code, "    fn compiled_with() -> Option<bool> {{").unwrap();
writeln!(code, "        #[cfg(all({cfg_all}))]").unwrap();
writeln!(code, "        {{ Some(true) }}").unwrap();
writeln!(code, "        #[cfg(not(all({cfg_all})))]").unwrap();
writeln!(code, "        {{ None }}").unwrap();
writeln!(code, "    }}").unwrap();
writeln!(code, "}}").unwrap();

// CORRECT — you can actually READ the generated code
code.push_str(&formatdoc! {r#"
    impl SimdToken for {name} {{
        const NAME: &'static str = "{display_name}";

        fn compiled_with() -> Option<bool> {{
            #[cfg(all({cfg_all}))]
            {{ Some(true) }}
            #[cfg(not(all({cfg_all})))]
            {{ None }}
        }}
    }}
"#});
```

### Pattern: conditional blocks

```rust
// Build a section conditionally, then splice it in
let cascade_code = if !descendants.is_empty() {
    let mut lines = String::new();
    for desc in &descendants {
        lines.push_str(&formatdoc! {r#"
            super::{module}::{cache}.store(v, Ordering::Relaxed);
            super::{module}::{disabled}.store(disabled, Ordering::Relaxed);
        "#, module = desc.module, cache = desc.cache_var, disabled = desc.disabled_var});
    }
    lines
} else {
    String::new()
};

code.push_str(&formatdoc! {r#"
    pub fn disable(disabled: bool) {{
        CACHE.store(if disabled {{ 1 }} else {{ 0 }}, Ordering::Relaxed);
        {cascade_code}
    }}
"#});
```

### Pattern: helper functions that return String

```rust
fn gen_compiled_with(name: &str, cfg_all: &str) -> String {
    formatdoc! {r#"
        fn compiled_with() -> Option<bool> {{
            #[cfg(all({cfg_all}))]
            {{ Some(true) }}
            #[cfg(not(all({cfg_all})))]
            {{ None }}
        }}
    "#}
}
```

### For magetypes method generation

Use the helpers in `xtask/src/simd_types/types.rs`:

```rust
use super::types::{gen_unary_method, gen_binary_method, gen_scalar_method};

code.push_str(&gen_unary_method("Compute absolute value", "abs", "Self(_mm256_abs_epi32(self.0))"));
code.push_str(&gen_binary_method("Add two vectors", "add", "Self(_mm256_add_epi32(self.0, other.0))"));
code.push_str(&gen_scalar_method("Extract first element", "first", "i32", "_mm_cvtsi128_si32(self.0)"));
```

### Enforcement

When touching ANY codegen file, convert `writeln!` chains to `formatdoc!` in the same commit. Don't add new `writeln!` chains. Existing `writeln!` chains in `token_gen.rs` (297 occurrences!) and `main.rs` (33 occurrences) are tech debt — convert them progressively.

## Token Hierarchy

**x86:**
- `X64V2Token` - SSE4.2 + POPCNT (Nehalem 2008+)
- `X64V3Token` / `Desktop64` / `X64V3Token` - AVX2 + FMA + BMI2 (Haswell 2013+, Zen 1+)
- `X64V4Token` / `Avx512Token` - + AVX-512 F/BW/CD/DQ/VL (Skylake-X 2017+, Zen 4+)
- `Avx512ModernToken` - + modern extensions (Ice Lake 2019+, Zen 4+)
- `Avx512Fp16Token` - + FP16 (Sapphire Rapids 2023+)

**ARM:**
- `NeonToken` / `Arm64` - NEON (virtually all AArch64, requires runtime detection)
- `NeonAesToken` - + AES
- `NeonSha3Token` - + SHA3
- `NeonCrcToken` - + CRC

## Tier Traits

Only two tier traits exist for generic bounds:

```rust
fn requires_v2(token: impl HasX64V2) { ... }
fn requires_v4(token: impl HasX64V4) { ... }
fn requires_neon(token: impl HasNeon) { ... }
```

For v3 (AVX2+FMA), use `X64V3Token` directly - it's the recommended baseline.

## SIMD Types (magetypes crate)

Token-gated SIMD types live in the **magetypes** crate. **Use fixed-size types:**

```rust
use archmage::{X64V3Token, SimdToken, arcane};
use magetypes::simd::f32x8;

pub fn process(data: &[f32; 8]) -> f32 {
    if let Some(token) = X64V3Token::summon() {
        process_simd(token, data)
    } else {
        data.iter().sum()
    }
}

#[arcane]
fn process_simd(token: X64V3Token, data: &[f32; 8]) -> f32 {
    let a = f32x8::load(token, data);
    let b = f32x8::splat(token, 2.0);
    let c = a * b;
    c.reduce_add()
}
```

On ARM/WASM, `f32x8` is polyfilled with two `f32x4` operations. Pick the size that fits your algorithm.

## Safe Memory Operations

Use `safe_unaligned_simd` directly inside `#[arcane]` functions:

```rust
use archmage::{Desktop64, SimdToken, arcane};

#[arcane]
fn process(_token: Desktop64, data: &[f32; 8]) -> [f32; 8] {
    // safe_unaligned_simd calls are SAFE inside #[arcane]
    let v = safe_unaligned_simd::x86_64::_mm256_loadu_ps(data);
    let squared = _mm256_mul_ps(v, v);
    let mut out = [0.0f32; 8];
    safe_unaligned_simd::x86_64::_mm256_storeu_ps(&mut out, squared);
    out
}
```

## Pending Work

### API Parity Status (0 issues — complete!)

**Current state:** All W128 types have identical APIs across x86/ARM/WASM. Reduced from 270 → 0 parity issues (100%).

Run `cargo xtask parity` to verify.

### Known Cross-Architecture Behavioral Differences

These are documented semantic differences between architectures. Tests must account for them; they are not bugs to fix.

| Issue | x86 | ARM | WASM | Workaround |
|-------|-----|-----|------|------------|
| Bitwise operators (`&`, `\|`, `^`) on integers | Trait impls (operators work) | Methods only | Methods only | Use `.and()`, `.or()`, `.xor()` methods |
| `shr` for signed integers | Logical (zero-fill) | Arithmetic (sign-extend) | Arithmetic (sign-extend) | Use `shr_arithmetic` for portable sign-extending shift |
| `blend` signature | `(mask, true, false)` | `(mask, true, false)` | `(self, other, mask)` | Avoid in portable code; use bitcast + comparison verification |
| `interleave_lo/hi` | f32x4 only | f32x4 only | f32x4 only | Only use on f32x4, not integer types |

### Long-Term

- **Generator test fixtures**: Add example input/expected output pairs to each xtask generator (SIMD types, width dispatch, tokens, macro registry). These serve as both documentation of expected output and cross-platform regression tests — run on x86, ARM, and WASM to catch codegen divergence.

- ~~**Target-feature boundary overhead benchmark**~~: Done. See `benches/asm_inspection.rs` and `docs/PERFORMANCE.md`. Key results:
  - Simple vector add (1000 x 8-float): `#[rite]` in `#[arcane]` 547 ns, `#[arcane]` per iteration 2209 ns (4x), bare `#[target_feature]` 2222 ns (4x, identical)
  - DCT-8 (100 rows x 8 dot products): `#[rite]` in `#[arcane]` 61 ns, `#[arcane]` per row 376 ns (6.2x), bare `#[target_feature]` 374 ns (6.2x, identical)
  - Cross-token nesting: downgrade (V4→V3, V3→V2) is free, upgrade (V2→V3, V3→V4) costs 4x, all patterns match bare `#[target_feature]`

  Key insight: the overhead is from the `#[target_feature]` optimization boundary, NOT from wrappers or archmage abstractions. The cost scales with computational density (4x simple add, 6.2x DCT-8). Feature direction matters: downgrades are free (superset enables inlining), upgrades hit the boundary.

- ~~**summon() caching**~~: **Implemented!** See `benches/summon_overhead.rs`. Results after adding atomic caching:
  - `Desktop64::summon()` (cached): ~1.3 ns (was 2.6 ns — **2x faster**)
  - `Avx512ModernToken::summon()` (cached): ~1.3 ns (was 7.2 ns — **6x faster**)
  - With `-Ctarget-cpu=haswell`: 0 ns (compiles away entirely)

  Implementation: Each token has a static `AtomicU8` cache (0=unknown, 1=unavailable, 2=available). Compile-time `#[cfg(target_feature)]` guard skips the cache entirely when features are guaranteed.

### safe_unaligned_simd Gaps (discovered in rav1d-safe refactoring)

Found during pal.rs refactoring to use `#[arcane]` + `safe_unaligned_simd`:

- **SOLVED: Created `partial_simd` module in rav1d-safe** with `Is64BitsUnaligned` trait:
  ```rust
  // Safe functions with #[target_feature] - callable from #[arcane] without unsafe!
  #[target_feature(enable = "sse2")]
  pub fn mm_loadl_epi64<T: Is64BitsUnaligned>(src: &T) -> __m128i {
      unsafe { _mm_loadl_epi64(ptr::from_ref(src).cast()) }
  }
  // Trait: [u8; 8], [i16; 4], [i32; 2], u64, i64, f64, etc.
  ```
  - Generates identical `vmovq` instructions (zero overhead)
  - Pattern ready for upstream to safe_unaligned_simd

- **Verified: No overhead from slice-to-array conversion**
  - `slice[..32].try_into().unwrap()` optimizes away completely
  - `safe_simd::_mm256_loadu_si256(arr)` → same `vmovdqu` as raw pointer

### Completed

- ~~**Type implementation verification**~~: Done. Added `implementation_name() -> &'static str` to all magetypes vectors. Uses tier-based naming: `"x86::v3::f32x8"`, `"x86::v4::f32x16"`, `"arm::neon::f32x4"`, `"wasm::wasm128::f32x4"`, `"polyfill::v3::f32x8"`, `"polyfill::v3_512::f32x16"`, `"polyfill::neon::f32x8"`. Test in `tests/exhaustive_intrinsics.rs`.
- ~~**WASM u64x2 ordering comparisons**~~: Done. Added simd_lt/le/gt/ge via bias-to-signed polyfill (XOR with i64::MIN, then i64x2_lt/gt). Parity: 4 → 0.
- ~~**x86 byte shift polyfills**~~: Done. Added i8x16/u8x16 shl, shr, shr_arithmetic for all x86 widths. Uses 16-bit shift + byte mask (~2 instructions). AVX-512 shr_arithmetic uses mask registers. Parity: 9 → 4.
- ~~**All actionable parity issues**~~: Done. Closed 28 remaining issues: extend/pack ops (17), RGBA pixel ops (4), i64/u64 polyfill math (7). Parity: 37 → 9 (0 actionable).
- ~~**ARM/WASM block ops**~~: Done. ARM uses native vzip1q/vzip2q, WASM uses i32x4_shuffle. Both gained interleave_lo/hi, interleave, deinterleave_4ch, interleave_4ch, transpose_4x4, transpose_4x4_copy. Parity: 47 → 37.
- ~~**WASM cbrt + f64x2 log10_lowp**~~: Done. WASM f32x4 gained cbrt_midp/cbrt_midp_precise (scalar initial guess + Newton-Raphson). WASM f64x2 gained log10_lowp via scalar fallback.
- ~~**ARM transcendentals + x86 missing variants**~~: Done. ARM f32x4 has full lowp+midp transcendentals (log2, exp2, ln, exp, log10, pow, cbrt) with all variant coverage. ARM f64x2 has lowp transcendentals via scalar fallback. x86 gained lowp _unchecked aliases, midp _precise variants, and log10_midp family. Parity: 80 → 47.
- ~~**API surface parity detection tool**~~: Done. Use `cargo xtask parity` to detect API variances between x86/ARM/WASM.
- ~~**Move generated files to subfolder**~~: Done. All generated code now lives in `generated/` subfolders.
- ~~**Merge WASM transcendentals from `feat/wasm128`**~~: Done (354dc2b). All `_unchecked` and `_precise` variants now generated.
- ~~**ARM comparison ops**~~: Done. Added simd_eq, simd_ne, simd_lt, simd_le, simd_gt, simd_ge, blend.
- ~~**ARM bitwise ops**~~: Done. Added not, shl, shr for all integer types.
- ~~**ARM boolean reductions**~~: Done. Added all_true, any_true, bitmask for all integer types.
- ~~**x86 boolean reductions**~~: Done. Added all_true, any_true, bitmask for all integer types (128/256/512-bit).
- ~~**WASM token-gated casting methods**~~: Done. Added cast_slice, cast_slice_mut, as_bytes, as_bytes_mut, from_bytes, from_bytes_owned (token-gated replacements for bytemuck, NOT actual Pod/Zeroable implementations).
- ~~**ARM reduce_add for unsigned**~~: Done. Extended reduce_add to all integer types including unsigned.
- ~~**Approximations (rcp, rsqrt) for ARM/WASM**~~: Done. ARM uses native vrecpe/vrsqrte, WASM uses division.
- ~~**mul_sub for ARM/WASM**~~: Done. ARM uses vfma with negation, WASM uses mul+sub.
- ~~**Type conversions for ARM/WASM**~~: Done. Added to_i32x4, to_i32x4_round, from_i32x4, to_f32x4, to_i32x4_low.
- ~~**shr_arithmetic for ARM/WASM**~~: Done. Added for i8x16, i16x8, i32x4.

## Suboptimal Intrinsics (needs faster-path overloads)

Track places where we use polyfills or slower instruction sequences because the base token lacks a native intrinsic, but a higher token would have one. Each entry should get a method overload that accepts the higher token for the fast path.

| Method | Token (slow) | Polyfill | Token (fast) | Native Intrinsic | Status |
|--------|-------------|----------|-------------|------------------|--------|
| f32 cbrt initial guess | all tokens | scalar extract + bit hack || No SIMD cbrt exists; consider SIMD bit hack via integer ops | Low priority |

**Rules for this section:**
- Only add entries when you've verified the faster intrinsic exists and is correct.
- The overload should take the higher token as a parameter (e.g., `fn min_fast(self, other: Self, _: X64V4Token) -> Self`).
- Or use trait bounds: `fn min<T: HasX64V4>(self, other: Self, _: T) -> Self` for the fast path.
- Remove entries when the fast-path overload is implemented.

### Completed fast-path overloads

All i64/u64 min/max/abs now have `_fast` variants that take `X64V4Token`:
- `i64x2::min_fast`, `max_fast`, `abs_fast`
- `u64x2::min_fast`, `max_fast`
- `i64x4::min_fast`, `max_fast`, `abs_fast`
- `u64x4::min_fast`, `max_fast`

## License

MIT OR Apache-2.0