archmage 0.4.0 - Docs.rs

# archmage

> Safely invoke your intrinsic power, using the tokens granted to you by the CPU. Cast primitive magics faster than any mage alive.

## CRITICAL: Token/Trait Design (DO NOT MODIFY)

### LLVM x86-64 Microarchitecture Levels

| Level | Features | Token | Trait |
|-------|----------|-------|-------|
| **v1** | SSE, SSE2 (baseline) | None | None (always available) |
| **v2** | + SSE3, SSSE3, SSE4.1, SSE4.2, POPCNT | `X64V2Token` | `HasX64V2` |
| **v3** | + AVX, AVX2, FMA, BMI1, BMI2, F16C | `X64V3Token` / `Desktop64` | Use token directly |
| **v4** | + AVX512F, AVX512BW, AVX512CD, AVX512DQ, AVX512VL | `X64V4Token` / `Avx512Token` | `HasX64V4` |
| **Modern** | + VPOPCNTDQ, IFMA, VBMI, VNNI, BF16, VBMI2, BITALG, VPCLMULQDQ, GFNI, VAES | `Avx512ModernToken` | Use token directly |
| **FP16** | AVX512FP16 (independent) | `Avx512Fp16Token` | Use token directly |

### AArch64 Tokens

| Token | Features | Trait |
|-------|----------|-------|
| `NeonToken` / `Arm64` | neon (always available) | `HasNeon` (baseline) |
| `NeonAesToken` | + aes | `HasNeonAes` |
| `NeonSha3Token` | + sha3 | `HasNeonSha3` |
| `NeonCrcToken` | + crc | Use token directly |

**PROHIBITED:** NO SVE/SVE2 - Rust stable doesn't support SVE intrinsics yet.

### Rules

1. **NO granular x86 traits** - No `HasSse`, `HasSse2`, `HasAvx`, `HasAvx2`, `HasFma`, `HasAvx512f`, `HasAvx512bw`, etc.
2. **Use tier tokens** - `X64V2Token`, `X64V3Token`, `X64V4Token`, `Avx512ModernToken`
3. **Single trait per tier** - `HasX64V2`, `HasX64V4` only
4. **NEON includes fp16** - They always appear together on AArch64
5. **NO SVE** - `SveToken`, `Sve2Token`, `HasSve`, `HasSve2` are PROHIBITED (Rust stable lacks SVE support)

---

## CRITICAL: Documentation Examples

### Always prefer `#[arcane]` over manual `#[target_feature]`

**DO NOT write examples with manual `#[target_feature]` + unsafe wrappers.** The `#[arcane]` macro does this automatically and is the correct pattern for archmage.

```rust
// WRONG - manual #[target_feature] wrapping
#[cfg(target_arch = "x86_64")]
#[inline]
#[target_feature(enable = "avx2", enable = "fma")]
unsafe fn process_inner(data: &[f32]) -> f32 { ... }

#[cfg(target_arch = "x86_64")]
fn process(token: X64V3Token, data: &[f32]) -> f32 {
    unsafe { process_inner(data) }
}

// CORRECT - use #[arcane] (it generates the above automatically)
#[cfg(target_arch = "x86_64")]
#[arcane]
fn process(token: X64V3Token, data: &[f32]) -> f32 {
    // This function body is compiled with #[target_feature(enable = "avx2,fma")]
    // Intrinsics and operators inline properly into single SIMD instructions
    ...
}
```

### Use `safe_unaligned_simd` inside `#[arcane]` functions

**Use `safe_unaligned_simd` directly inside `#[arcane]` functions.** The calls are safe because the target features match.

```rust
// WRONG - raw pointers need unsafe
let v = unsafe { _mm256_loadu_ps(data.as_ptr()) };

// CORRECT - use safe_unaligned_simd (safe inside #[arcane])
let v = safe_unaligned_simd::x86_64::_mm256_loadu_ps(data);
```

## Quick Start

```bash
cargo test                    # Run tests
cargo test --all-features     # Test with all integrations
cargo clippy --all-features   # Lint
just generate                 # Regenerate all generated code
just validate-registry        # Validate token-registry.toml
just validate-tokens          # Validate magetypes safety + try_new() checks
just parity                   # Check API parity across x86/ARM/WASM
just soundness                # Static intrinsic soundness verification
just miri                     # Run magetypes under Miri (detects UB)
just audit                    # Scan for safety-critical code
just intrinsics-refresh       # Re-extract intrinsics from current Rust
just ci                       # Run ALL checks (must pass before push/publish)
```

## CI and Publishing Rules

**ABSOLUTE REQUIREMENT: Run `just ci` (or `just all` or `cargo xtask all`) before ANY push or publish.**

```bash
just ci    # or: just all, cargo xtask ci, cargo xtask all
```

**NEVER run `git push` or `cargo publish` until this passes. No exceptions.**

CI checks (all must pass):
1. `cargo xtask generate` — regenerate all code
2. **Clean worktree check** — no uncommitted changes after generation (HARD FAIL)
3. `cargo xtask validate` — intrinsic safety + try_new() feature verification
4. `cargo xtask parity` — parity check (0 issues remaining)
5. `cargo clippy --features "std macros bytemuck avx512"` — zero warnings
6. `cargo test --features "std macros bytemuck avx512"` — all tests pass
7. `cargo fmt --check` — code is formatted

**Note:** Parity check reports 0 issues. All W128 types have identical APIs across x86/ARM/WASM.

If ANY check fails:
- Do NOT push
- Do NOT publish
- Fix the issue first
- Re-run `just ci` until it passes

## Source of Truth: token-registry.toml

All token definitions, feature sets, trait mappings, and width configurations
live in `token-registry.toml`. Everything else is derived:

- `src/tokens/generated/` — token structs, traits, stubs, generated by xtask
- `archmage-macros/src/generated/` — macro registry, generated by xtask
- `magetypes/src/simd/generated/` — SIMD types, generated by xtask
- `docs/generated/` — intrinsics reference docs, generated by xtask
- `xtask/src/main.rs` validation — reads registry at runtime
- `cargo xtask validate` — verifies try_new() checks match registry
- `cargo xtask parity` — checks API parity across architectures

To add/modify tokens: edit `token-registry.toml`, then `just generate`.

## Core Insight: Rust 1.85+ Changed Everything

As of Rust 1.85, **value-based intrinsics are safe inside `#[target_feature]` functions**:

```rust
#[target_feature(enable = "avx2")]
unsafe fn example() {
    let a = _mm256_setzero_ps();           // SAFE!
    let b = _mm256_add_ps(a, a);           // SAFE!
    let c = _mm256_fmadd_ps(a, a, a);      // SAFE!

    // Only memory ops remain unsafe (raw pointers)
    let v = unsafe { _mm256_loadu_ps(ptr) };  // Still needs unsafe
}
```

This means we **don't need to wrap** arithmetic, shuffle, compare, bitwise, or other value-based intrinsics. Only:
1. **Tokens** - Prove CPU features are available
2. **`#[arcane]` macro** - Enable `#[target_feature]` via token proof
3. **`safe_unaligned_simd`** - Reference-based memory operations (user adds as dependency)

## How `#[arcane]` Works

The macro generates an inner function with `#[target_feature]`:

```rust
// You write:
#[arcane]
fn my_kernel(token: X64V3Token, data: &[f32; 8]) -> [f32; 8] {
    let v = _mm256_setzero_ps();  // Safe!
    // ...
}

// Macro generates:
fn my_kernel(token: X64V3Token, data: &[f32; 8]) -> [f32; 8] {
    #[target_feature(enable = "avx2,fma")]
    unsafe fn inner(data: &[f32; 8]) -> [f32; 8] {
        let v = _mm256_setzero_ps();  // Safe inside #[target_feature]!
        // ...
    }
    // SAFETY: Token proves CPU support was verified via try_new()
    unsafe { inner(data) }
}
```

## Friendly Aliases

| Alias | Token | What it means |
|-------|-------|---------------|
| `Desktop64` | `X64V3Token` | AVX2 + FMA (Haswell 2013+, Zen 1+) |
| `Server64` | `X64V4Token` | + AVX-512 (Xeon 2017+, Zen 4+) |
| `Arm64` | `NeonToken` | NEON + FP16 (all 64-bit ARM) |

```rust
use archmage::{Desktop64, SimdToken, arcane};

#[arcane]
fn process(token: Desktop64, data: &mut [f32; 8]) {
    // AVX2 + FMA intrinsics safe here
}

if let Some(token) = Desktop64::summon() {
    process(token, &mut data);
}
```

## Directory Structure

```
token-registry.toml          # THE source of truth for all token/trait/feature data
spec.md                      # Architecture spec and safety model documentation
archmage/                    # Main crate: tokens, macros, detect
├── src/
│   ├── lib.rs              # Main exports
│   ├── tokens/             # SIMD capability tokens
│   │   ├── mod.rs          # SimdToken trait definition only
│   │   └── generated/      # Generated from token-registry.toml
│   │       ├── mod.rs      # cfg-gated module routing + re-exports
│   │       ├── traits.rs   # Marker traits (Has128BitSimd, HasX64V2, etc.)
│   │       ├── x86.rs      # x86 tokens (v2, v3) + detection
│   │       ├── x86_avx512.rs  # AVX-512 tokens (v4, modern, fp16)
│   │       ├── arm.rs      # ARM tokens + detection
│   │       ├── wasm.rs     # WASM tokens + detection
│   │       ├── x86_stubs.rs   # x86 stubs (try_new → None)
│   │       ├── arm_stubs.rs   # ARM stubs
│   │       └── wasm_stubs.rs  # WASM stubs
archmage-macros/             # Proc-macro crate (#[arcane], #[multiwidth])
└── src/
    ├── lib.rs              # Macro implementation
    └── generated/          # Generated from token-registry.toml
        ├── mod.rs          # Re-exports
        └── registry.rs     # Token→features mappings
magetypes/                   # SIMD types crate (depends on archmage)
├── src/
│   ├── lib.rs              # Exports simd module
│   └── simd/
│       ├── mod.rs          # Re-exports from generated/
│       └── generated/      # Auto-generated SIMD types
│           ├── x86/        # x86-64 types (w128, w256, w512)
│           ├── arm/        # AArch64 types (w128)
│           ├── wasm/       # WASM types (w128)
│           └── polyfill.rs # Width emulation
docs/
└── generated/              # Auto-generated reference docs
    ├── x86-intrinsics-by-token.md
    ├── aarch64-intrinsics-by-token.md
    └── memory-ops-reference.md
xtask/                       # Code generator and validation
└── src/
    ├── main.rs             # Generates everything, validates safety, parity check
    ├── registry.rs         # token-registry.toml parser
    └── token_gen.rs        # Token/trait code generator
```

## CRITICAL: Codegen Style Rules

**NEVER use `writeln!` chains or `write!` chains for code generation.** Use `r#"..."#` raw strings with `formatdoc!` (from the `indoc` crate) instead:

```rust
// WRONG - verbose, hard to read, easy to get wrong
writeln!(code, "/// {doc}").unwrap();
writeln!(code, "pub fn {name}(self) -> Self {{").unwrap();
writeln!(code, "    Self({body})").unwrap();
writeln!(code, "}}").unwrap();

// CORRECT - use formatdoc! with raw strings
use indoc::formatdoc;
code.push_str(&formatdoc! {r#"
    /// {doc}
    pub fn {name}(self) -> Self {{
        Self({body})
    }}
"#});
```

For method generation helpers, use the helpers in `xtask/src/simd_types/types.rs`:

```rust
use super::types::{gen_unary_method, gen_binary_method, gen_scalar_method};

code.push_str(&gen_unary_method("Compute absolute value", "abs", "Self(_mm256_abs_epi32(self.0))"));
code.push_str(&gen_binary_method("Add two vectors", "add", "Self(_mm256_add_epi32(self.0, other.0))"));
code.push_str(&gen_scalar_method("Extract first element", "first", "i32", "_mm_cvtsi128_si32(self.0)"));
```

## Token Hierarchy

**x86:**
- `X64V2Token` - SSE4.2 + POPCNT (Nehalem 2008+)
- `X64V3Token` / `Desktop64` / `X64V3Token` - AVX2 + FMA + BMI2 (Haswell 2013+, Zen 1+)
- `X64V4Token` / `Avx512Token` - + AVX-512 F/BW/CD/DQ/VL (Skylake-X 2017+, Zen 4+)
- `Avx512ModernToken` - + modern extensions (Ice Lake 2019+, Zen 4+)
- `Avx512Fp16Token` - + FP16 (Sapphire Rapids 2023+)

**ARM:**
- `NeonToken` / `Arm64` - NEON (baseline, always available)
- `NeonAesToken` - + AES
- `NeonSha3Token` - + SHA3
- `NeonCrcToken` - + CRC

## Tier Traits

Only two tier traits exist for generic bounds:

```rust
fn requires_v2(token: impl HasX64V2) { ... }
fn requires_v4(token: impl HasX64V4) { ... }
fn requires_neon(token: impl HasNeon) { ... }
```

For v3 (AVX2+FMA), use `X64V3Token` directly - it's the recommended baseline.

## SIMD Types (magetypes crate)

Token-gated SIMD types live in the **magetypes** crate:

```rust
use archmage::{X64V3Token, SimdToken};
use magetypes::simd::f32x8;

if let Some(token) = X64V3Token::summon() {
    let a = f32x8::splat(token, 1.0);
    let b = f32x8::splat(token, 2.0);
    let c = a + b;  // Natural operators!
}
```

For multiwidth code, use `magetypes::simd::*`:

```rust
use archmage::multiwidth;

#[multiwidth]
mod kernels {
    use magetypes::simd::*;

    pub fn sum(token: Token, data: &[f32]) -> f32 {
        let mut acc = f32xN::zero(token);
        // ...
    }
}
```

## Safe Memory Operations

Use `safe_unaligned_simd` directly inside `#[arcane]` functions:

```rust
use archmage::{Desktop64, SimdToken, arcane};

#[arcane]
fn process(_token: Desktop64, data: &[f32; 8]) -> [f32; 8] {
    // safe_unaligned_simd calls are SAFE inside #[arcane]
    let v = safe_unaligned_simd::x86_64::_mm256_loadu_ps(data);
    let squared = _mm256_mul_ps(v, v);
    let mut out = [0.0f32; 8];
    safe_unaligned_simd::x86_64::_mm256_storeu_ps(&mut out, squared);
    out
}
```

## Pending Work

### API Parity Status (0 issues — complete!)

**Current state:** All W128 types have identical APIs across x86/ARM/WASM. Reduced from 270 → 0 parity issues (100%).

Run `cargo xtask parity` to verify.

### Known Cross-Architecture Behavioral Differences

These are documented semantic differences between architectures. Tests must account for them; they are not bugs to fix.

| Issue | x86 | ARM | WASM | Workaround |
|-------|-----|-----|------|------------|
| Bitwise operators (`&`, `\|`, `^`) on integers | Trait impls (operators work) | Methods only | Methods only | Use `.and()`, `.or()`, `.xor()` methods |
| `shr` for signed integers | Logical (zero-fill) | Arithmetic (sign-extend) | Arithmetic (sign-extend) | Use `shr_arithmetic` for portable sign-extending shift |
| `blend` signature | `(mask, true, false)` | `(mask, true, false)` | `(self, other, mask)` | Avoid in portable code; use bitcast + comparison verification |
| `interleave_lo/hi` | f32x4 only | f32x4 only | f32x4 only | Only use on f32x4, not integer types |

### Long-Term

- **Generator test fixtures**: Add example input/expected output pairs to each xtask generator (SIMD types, width dispatch, tokens, macro registry). These serve as both documentation of expected output and cross-platform regression tests — run on x86, ARM, and WASM to catch codegen divergence.

### Completed

- ~~**WASM u64x2 ordering comparisons**~~: Done. Added simd_lt/le/gt/ge via bias-to-signed polyfill (XOR with i64::MIN, then i64x2_lt/gt). Parity: 4 → 0.
- ~~**x86 byte shift polyfills**~~: Done. Added i8x16/u8x16 shl, shr, shr_arithmetic for all x86 widths. Uses 16-bit shift + byte mask (~2 instructions). AVX-512 shr_arithmetic uses mask registers. Parity: 9 → 4.
- ~~**All actionable parity issues**~~: Done. Closed 28 remaining issues: extend/pack ops (17), RGBA pixel ops (4), i64/u64 polyfill math (7). Parity: 37 → 9 (0 actionable).
- ~~**ARM/WASM block ops**~~: Done. ARM uses native vzip1q/vzip2q, WASM uses i32x4_shuffle. Both gained interleave_lo/hi, interleave, deinterleave_4ch, interleave_4ch, transpose_4x4, transpose_4x4_copy. Parity: 47 → 37.
- ~~**WASM cbrt + f64x2 log10_lowp**~~: Done. WASM f32x4 gained cbrt_midp/cbrt_midp_precise (scalar initial guess + Newton-Raphson). WASM f64x2 gained log10_lowp via scalar fallback.
- ~~**ARM transcendentals + x86 missing variants**~~: Done. ARM f32x4 has full lowp+midp transcendentals (log2, exp2, ln, exp, log10, pow, cbrt) with all variant coverage. ARM f64x2 has lowp transcendentals via scalar fallback. x86 gained lowp _unchecked aliases, midp _precise variants, and log10_midp family. Parity: 80 → 47.
- ~~**API surface parity detection tool**~~: Done. Use `cargo xtask parity` to detect API variances between x86/ARM/WASM.
- ~~**Move generated files to subfolder**~~: Done. All generated code now lives in `generated/` subfolders.
- ~~**Merge WASM transcendentals from `feat/wasm128`**~~: Done (354dc2b). All `_unchecked` and `_precise` variants now generated.
- ~~**ARM comparison ops**~~: Done. Added simd_eq, simd_ne, simd_lt, simd_le, simd_gt, simd_ge, blend.
- ~~**ARM bitwise ops**~~: Done. Added not, shl, shr for all integer types.
- ~~**ARM boolean reductions**~~: Done. Added all_true, any_true, bitmask for all integer types.
- ~~**x86 boolean reductions**~~: Done. Added all_true, any_true, bitmask for all integer types (128/256/512-bit).
- ~~**WASM bytemuck methods**~~: Done. Added cast_slice, cast_slice_mut, as_bytes, as_bytes_mut, from_bytes, from_bytes_owned.
- ~~**ARM reduce_add for unsigned**~~: Done. Extended reduce_add to all integer types including unsigned.
- ~~**Approximations (rcp, rsqrt) for ARM/WASM**~~: Done. ARM uses native vrecpe/vrsqrte, WASM uses division.
- ~~**mul_sub for ARM/WASM**~~: Done. ARM uses vfma with negation, WASM uses mul+sub.
- ~~**Type conversions for ARM/WASM**~~: Done. Added to_i32x4, to_i32x4_round, from_i32x4, to_f32x4, to_i32x4_low.
- ~~**shr_arithmetic for ARM/WASM**~~: Done. Added for i8x16, i16x8, i32x4.

## Suboptimal Intrinsics (needs faster-path overloads)

Track places where we use polyfills or slower instruction sequences because the base token lacks a native intrinsic, but a higher token would have one. Each entry should get a method overload that accepts the higher token for the fast path.

| Method | Token (slow) | Polyfill | Token (fast) | Native Intrinsic | Status |
|--------|-------------|----------|-------------|------------------|--------|
| f32 cbrt initial guess | all tokens | scalar extract + bit hack | — | No SIMD cbrt exists; consider SIMD bit hack via integer ops | Low priority |

**Rules for this section:**
- Only add entries when you've verified the faster intrinsic exists and is correct.
- The overload should take the higher token as a parameter (e.g., `fn min_fast(self, other: Self, _: X64V4Token) -> Self`).
- Or use trait bounds: `fn min<T: HasX64V4>(self, other: Self, _: T) -> Self` for the fast path.
- Remove entries when the fast-path overload is implemented.

### Completed fast-path overloads

All i64/u64 min/max/abs now have `_fast` variants that take `X64V4Token`:
- `i64x2::min_fast`, `max_fast`, `abs_fast`
- `u64x2::min_fast`, `max_fast`
- `i64x4::min_fast`, `max_fast`, `abs_fast`
- `u64x4::min_fast`, `max_fast`

## License

MIT OR Apache-2.0