# archmage
> Safely invoke your intrinsic power, using the tokens granted to you by the CPU. Cast primitive magics faster than any mage alive.
## CRITICAL: Token/Trait Design (DO NOT MODIFY)
### LLVM x86-64 Microarchitecture Levels
| **v1** | SSE, SSE2 (baseline) | None | None (always available) |
| **v2** | + SSE3, SSSE3, SSE4.1, SSE4.2, POPCNT | `X64V2Token` | `HasX64V2` |
| **v3** | + AVX, AVX2, FMA, BMI1, BMI2, F16C | `X64V3Token` / `Desktop64` | Use token directly |
| **v4** | + AVX512F, AVX512BW, AVX512CD, AVX512DQ, AVX512VL | `X64V4Token` / `Avx512Token` | `HasX64V4` |
| **Modern** | + VPOPCNTDQ, IFMA, VBMI, VNNI, BF16, VBMI2, BITALG, VPCLMULQDQ, GFNI, VAES | `Avx512ModernToken` | Use token directly |
| **FP16** | AVX512FP16 (independent) | `Avx512Fp16Token` | Use token directly |
### AArch64 Tokens
| `NeonToken` / `Arm64` | neon (always available) | `HasNeon` (baseline) |
| `NeonAesToken` | + aes | `HasNeonAes` |
| `NeonSha3Token` | + sha3 | `HasNeonSha3` |
| `NeonCrcToken` | + crc | Use token directly |
**PROHIBITED:** NO SVE/SVE2 - Rust stable doesn't support SVE intrinsics yet.
### Rules
1. **NO granular x86 traits** - No `HasSse`, `HasSse2`, `HasAvx`, `HasAvx2`, `HasFma`, `HasAvx512f`, `HasAvx512bw`, etc.
2. **Use tier tokens** - `X64V2Token`, `X64V3Token`, `X64V4Token`, `Avx512ModernToken`
3. **Single trait per tier** - `HasX64V2`, `HasX64V4` only
4. **NEON includes fp16** - They always appear together on AArch64
5. **NO SVE** - `SveToken`, `Sve2Token`, `HasSve`, `HasSve2` are PROHIBITED (Rust stable lacks SVE support)
---
## CRITICAL: Documentation Examples
### Always prefer `#[arcane]` over manual `#[target_feature]`
**DO NOT write examples with manual `#[target_feature]` + unsafe wrappers.** The `#[arcane]` macro does this automatically and is the correct pattern for archmage.
```rust
// WRONG - manual #[target_feature] wrapping
#[cfg(target_arch = "x86_64")]
#[inline]
#[target_feature(enable = "avx2", enable = "fma")]
unsafe fn process_inner(data: &[f32]) -> f32 { ... }
#[cfg(target_arch = "x86_64")]
fn process(token: X64V3Token, data: &[f32]) -> f32 {
unsafe { process_inner(data) }
}
// CORRECT - use #[arcane] (it generates the above automatically)
#[cfg(target_arch = "x86_64")]
#[arcane]
fn process(token: X64V3Token, data: &[f32]) -> f32 {
// This function body is compiled with #[target_feature(enable = "avx2,fma")]
// Intrinsics and operators inline properly into single SIMD instructions
...
}
```
### Use `safe_unaligned_simd` inside `#[arcane]` functions
**Use `safe_unaligned_simd` directly inside `#[arcane]` functions.** The calls are safe because the target features match.
```rust
// WRONG - raw pointers need unsafe
let v = unsafe { _mm256_loadu_ps(data.as_ptr()) };
// CORRECT - use safe_unaligned_simd (safe inside #[arcane])
let v = safe_unaligned_simd::x86_64::_mm256_loadu_ps(data);
```
## Quick Start
```bash
cargo test # Run tests
cargo test --all-features # Test with all integrations
cargo clippy --all-features # Lint
just generate # Regenerate all generated code
just validate-registry # Validate token-registry.toml
just validate-tokens # Validate magetypes safety + try_new() checks
just parity # Check API parity across x86/ARM/WASM
just soundness # Static intrinsic soundness verification
just miri # Run magetypes under Miri (detects UB)
just audit # Scan for safety-critical code
just intrinsics-refresh # Re-extract intrinsics from current Rust
just ci # Run ALL checks (must pass before push/publish)
```
## CI and Publishing Rules
**ABSOLUTE REQUIREMENT: Run `just ci` (or `just all` or `cargo xtask all`) before ANY push or publish.**
```bash
just ci # or: just all, cargo xtask ci, cargo xtask all
```
**NEVER run `git push` or `cargo publish` until this passes. No exceptions.**
CI checks (all must pass):
1. `cargo xtask generate` — regenerate all code
2. **Clean worktree check** — no uncommitted changes after generation (HARD FAIL)
3. `cargo xtask validate` — intrinsic safety + try_new() feature verification
4. `cargo xtask parity` — parity check (0 issues remaining)
5. `cargo clippy --features "std macros bytemuck avx512"` — zero warnings
6. `cargo test --features "std macros bytemuck avx512"` — all tests pass
7. `cargo fmt --check` — code is formatted
**Note:** Parity check reports 0 issues. All W128 types have identical APIs across x86/ARM/WASM.
If ANY check fails:
- Do NOT push
- Do NOT publish
- Fix the issue first
- Re-run `just ci` until it passes
## Source of Truth: token-registry.toml
All token definitions, feature sets, trait mappings, and width configurations
live in `token-registry.toml`. Everything else is derived:
- `src/tokens/generated/` — token structs, traits, stubs, generated by xtask
- `archmage-macros/src/generated/` — macro registry, generated by xtask
- `magetypes/src/simd/generated/` — SIMD types, generated by xtask
- `docs/generated/` — intrinsics reference docs, generated by xtask
- `xtask/src/main.rs` validation — reads registry at runtime
- `cargo xtask validate` — verifies try_new() checks match registry
- `cargo xtask parity` — checks API parity across architectures
To add/modify tokens: edit `token-registry.toml`, then `just generate`.
## Core Insight: Rust 1.85+ Changed Everything
As of Rust 1.85, **value-based intrinsics are safe inside `#[target_feature]` functions**:
```rust
#[target_feature(enable = "avx2")]
unsafe fn example() {
let a = _mm256_setzero_ps(); // SAFE!
let b = _mm256_add_ps(a, a); // SAFE!
let c = _mm256_fmadd_ps(a, a, a); // SAFE!
// Only memory ops remain unsafe (raw pointers)
let v = unsafe { _mm256_loadu_ps(ptr) }; // Still needs unsafe
}
```
This means we **don't need to wrap** arithmetic, shuffle, compare, bitwise, or other value-based intrinsics. Only:
1. **Tokens** - Prove CPU features are available
2. **`#[arcane]` macro** - Enable `#[target_feature]` via token proof
3. **`safe_unaligned_simd`** - Reference-based memory operations (user adds as dependency)
## How `#[arcane]` Works
The macro generates an inner function with `#[target_feature]`:
```rust
// You write:
#[arcane]
fn my_kernel(token: X64V3Token, data: &[f32; 8]) -> [f32; 8] {
let v = _mm256_setzero_ps(); // Safe!
// ...
}
// Macro generates:
fn my_kernel(token: X64V3Token, data: &[f32; 8]) -> [f32; 8] {
#[target_feature(enable = "avx2,fma")]
unsafe fn inner(data: &[f32; 8]) -> [f32; 8] {
let v = _mm256_setzero_ps(); // Safe inside #[target_feature]!
// ...
}
// SAFETY: Token proves CPU support was verified via try_new()
unsafe { inner(data) }
}
```
## Friendly Aliases
| `Desktop64` | `X64V3Token` | AVX2 + FMA (Haswell 2013+, Zen 1+) |
| `Server64` | `X64V4Token` | + AVX-512 (Xeon 2017+, Zen 4+) |
| `Arm64` | `NeonToken` | NEON + FP16 (all 64-bit ARM) |
```rust
use archmage::{Desktop64, SimdToken, arcane};
#[arcane]
fn process(token: Desktop64, data: &mut [f32; 8]) {
// AVX2 + FMA intrinsics safe here
}
if let Some(token) = Desktop64::summon() {
process(token, &mut data);
}
```
## Directory Structure
```
token-registry.toml # THE source of truth for all token/trait/feature data
spec.md # Architecture spec and safety model documentation
archmage/ # Main crate: tokens, macros, detect
├── src/
│ ├── lib.rs # Main exports
│ ├── tokens/ # SIMD capability tokens
│ │ ├── mod.rs # SimdToken trait definition only
│ │ └── generated/ # Generated from token-registry.toml
│ │ ├── mod.rs # cfg-gated module routing + re-exports
│ │ ├── traits.rs # Marker traits (Has128BitSimd, HasX64V2, etc.)
│ │ ├── x86.rs # x86 tokens (v2, v3) + detection
│ │ ├── x86_avx512.rs # AVX-512 tokens (v4, modern, fp16)
│ │ ├── arm.rs # ARM tokens + detection
│ │ ├── wasm.rs # WASM tokens + detection
│ │ ├── x86_stubs.rs # x86 stubs (try_new → None)
│ │ ├── arm_stubs.rs # ARM stubs
│ │ └── wasm_stubs.rs # WASM stubs
archmage-macros/ # Proc-macro crate (#[arcane], #[multiwidth])
└── src/
├── lib.rs # Macro implementation
└── generated/ # Generated from token-registry.toml
├── mod.rs # Re-exports
└── registry.rs # Token→features mappings
magetypes/ # SIMD types crate (depends on archmage)
├── src/
│ ├── lib.rs # Exports simd module
│ └── simd/
│ ├── mod.rs # Re-exports from generated/
│ └── generated/ # Auto-generated SIMD types
│ ├── x86/ # x86-64 types (w128, w256, w512)
│ ├── arm/ # AArch64 types (w128)
│ ├── wasm/ # WASM types (w128)
│ └── polyfill.rs # Width emulation
docs/
└── generated/ # Auto-generated reference docs
├── x86-intrinsics-by-token.md
├── aarch64-intrinsics-by-token.md
└── memory-ops-reference.md
xtask/ # Code generator and validation
└── src/
├── main.rs # Generates everything, validates safety, parity check
├── registry.rs # token-registry.toml parser
└── token_gen.rs # Token/trait code generator
```
## CRITICAL: Codegen Style Rules
**NEVER use `writeln!` chains or `write!` chains for code generation.** Use `r#"..."#` raw strings with `formatdoc!` (from the `indoc` crate) instead:
```rust
// WRONG - verbose, hard to read, easy to get wrong
writeln!(code, "/// {doc}").unwrap();
writeln!(code, "pub fn {name}(self) -> Self {{").unwrap();
writeln!(code, " Self({body})").unwrap();
writeln!(code, "}}").unwrap();
// CORRECT - use formatdoc! with raw strings
use indoc::formatdoc;
code.push_str(&formatdoc! {r#"
/// {doc}
pub fn {name}(self) -> Self {{
Self({body})
}}
"#});
```
For method generation helpers, use the helpers in `xtask/src/simd_types/types.rs`:
```rust
use super::types::{gen_unary_method, gen_binary_method, gen_scalar_method};
code.push_str(&gen_unary_method("Compute absolute value", "abs", "Self(_mm256_abs_epi32(self.0))"));
code.push_str(&gen_binary_method("Add two vectors", "add", "Self(_mm256_add_epi32(self.0, other.0))"));
code.push_str(&gen_scalar_method("Extract first element", "first", "i32", "_mm_cvtsi128_si32(self.0)"));
```
## Token Hierarchy
**x86:**
- `X64V2Token` - SSE4.2 + POPCNT (Nehalem 2008+)
- `X64V3Token` / `Desktop64` / `X64V3Token` - AVX2 + FMA + BMI2 (Haswell 2013+, Zen 1+)
- `X64V4Token` / `Avx512Token` - + AVX-512 F/BW/CD/DQ/VL (Skylake-X 2017+, Zen 4+)
- `Avx512ModernToken` - + modern extensions (Ice Lake 2019+, Zen 4+)
- `Avx512Fp16Token` - + FP16 (Sapphire Rapids 2023+)
**ARM:**
- `NeonToken` / `Arm64` - NEON (baseline, always available)
- `NeonAesToken` - + AES
- `NeonSha3Token` - + SHA3
- `NeonCrcToken` - + CRC
## Tier Traits
Only two tier traits exist for generic bounds:
```rust
fn requires_v2(token: impl HasX64V2) { ... }
fn requires_v4(token: impl HasX64V4) { ... }
fn requires_neon(token: impl HasNeon) { ... }
```
For v3 (AVX2+FMA), use `X64V3Token` directly - it's the recommended baseline.
## SIMD Types (magetypes crate)
Token-gated SIMD types live in the **magetypes** crate:
```rust
use archmage::{X64V3Token, SimdToken};
use magetypes::simd::f32x8;
if let Some(token) = X64V3Token::summon() {
let a = f32x8::splat(token, 1.0);
let b = f32x8::splat(token, 2.0);
let c = a + b; // Natural operators!
}
```
For multiwidth code, use `magetypes::simd::*`:
```rust
use archmage::multiwidth;
#[multiwidth]
mod kernels {
use magetypes::simd::*;
pub fn sum(token: Token, data: &[f32]) -> f32 {
let mut acc = f32xN::zero(token);
// ...
}
}
```
## Safe Memory Operations
Use `safe_unaligned_simd` directly inside `#[arcane]` functions:
```rust
use archmage::{Desktop64, SimdToken, arcane};
#[arcane]
fn process(_token: Desktop64, data: &[f32; 8]) -> [f32; 8] {
// safe_unaligned_simd calls are SAFE inside #[arcane]
let v = safe_unaligned_simd::x86_64::_mm256_loadu_ps(data);
let squared = _mm256_mul_ps(v, v);
let mut out = [0.0f32; 8];
safe_unaligned_simd::x86_64::_mm256_storeu_ps(&mut out, squared);
out
}
```
## Pending Work
### API Parity Status (0 issues — complete!)
**Current state:** All W128 types have identical APIs across x86/ARM/WASM. Reduced from 270 → 0 parity issues (100%).
Run `cargo xtask parity` to verify.
### Known Cross-Architecture Behavioral Differences
These are documented semantic differences between architectures. Tests must account for them; they are not bugs to fix.
| Bitwise operators (`&`, `\|`, `^`) on integers | Trait impls (operators work) | Methods only | Methods only | Use `.and()`, `.or()`, `.xor()` methods |
| `shr` for signed integers | Logical (zero-fill) | Arithmetic (sign-extend) | Arithmetic (sign-extend) | Use `shr_arithmetic` for portable sign-extending shift |
| `blend` signature | `(mask, true, false)` | `(mask, true, false)` | `(self, other, mask)` | Avoid in portable code; use bitcast + comparison verification |
| `interleave_lo/hi` | f32x4 only | f32x4 only | f32x4 only | Only use on f32x4, not integer types |
### Long-Term
- **Generator test fixtures**: Add example input/expected output pairs to each xtask generator (SIMD types, width dispatch, tokens, macro registry). These serve as both documentation of expected output and cross-platform regression tests — run on x86, ARM, and WASM to catch codegen divergence.
### Completed
- ~~**WASM u64x2 ordering comparisons**~~: Done. Added simd_lt/le/gt/ge via bias-to-signed polyfill (XOR with i64::MIN, then i64x2_lt/gt). Parity: 4 → 0.
- ~~**x86 byte shift polyfills**~~: Done. Added i8x16/u8x16 shl, shr, shr_arithmetic for all x86 widths. Uses 16-bit shift + byte mask (~2 instructions). AVX-512 shr_arithmetic uses mask registers. Parity: 9 → 4.
- ~~**All actionable parity issues**~~: Done. Closed 28 remaining issues: extend/pack ops (17), RGBA pixel ops (4), i64/u64 polyfill math (7). Parity: 37 → 9 (0 actionable).
- ~~**ARM/WASM block ops**~~: Done. ARM uses native vzip1q/vzip2q, WASM uses i32x4_shuffle. Both gained interleave_lo/hi, interleave, deinterleave_4ch, interleave_4ch, transpose_4x4, transpose_4x4_copy. Parity: 47 → 37.
- ~~**WASM cbrt + f64x2 log10_lowp**~~: Done. WASM f32x4 gained cbrt_midp/cbrt_midp_precise (scalar initial guess + Newton-Raphson). WASM f64x2 gained log10_lowp via scalar fallback.
- ~~**ARM transcendentals + x86 missing variants**~~: Done. ARM f32x4 has full lowp+midp transcendentals (log2, exp2, ln, exp, log10, pow, cbrt) with all variant coverage. ARM f64x2 has lowp transcendentals via scalar fallback. x86 gained lowp _unchecked aliases, midp _precise variants, and log10_midp family. Parity: 80 → 47.
- ~~**API surface parity detection tool**~~: Done. Use `cargo xtask parity` to detect API variances between x86/ARM/WASM.
- ~~**Move generated files to subfolder**~~: Done. All generated code now lives in `generated/` subfolders.
- ~~**Merge WASM transcendentals from `feat/wasm128`**~~: Done (354dc2b). All `_unchecked` and `_precise` variants now generated.
- ~~**ARM comparison ops**~~: Done. Added simd_eq, simd_ne, simd_lt, simd_le, simd_gt, simd_ge, blend.
- ~~**ARM bitwise ops**~~: Done. Added not, shl, shr for all integer types.
- ~~**ARM boolean reductions**~~: Done. Added all_true, any_true, bitmask for all integer types.
- ~~**x86 boolean reductions**~~: Done. Added all_true, any_true, bitmask for all integer types (128/256/512-bit).
- ~~**WASM bytemuck methods**~~: Done. Added cast_slice, cast_slice_mut, as_bytes, as_bytes_mut, from_bytes, from_bytes_owned.
- ~~**ARM reduce_add for unsigned**~~: Done. Extended reduce_add to all integer types including unsigned.
- ~~**Approximations (rcp, rsqrt) for ARM/WASM**~~: Done. ARM uses native vrecpe/vrsqrte, WASM uses division.
- ~~**mul_sub for ARM/WASM**~~: Done. ARM uses vfma with negation, WASM uses mul+sub.
- ~~**Type conversions for ARM/WASM**~~: Done. Added to_i32x4, to_i32x4_round, from_i32x4, to_f32x4, to_i32x4_low.
- ~~**shr_arithmetic for ARM/WASM**~~: Done. Added for i8x16, i16x8, i32x4.
## Suboptimal Intrinsics (needs faster-path overloads)
Track places where we use polyfills or slower instruction sequences because the base token lacks a native intrinsic, but a higher token would have one. Each entry should get a method overload that accepts the higher token for the fast path.
| f32 cbrt initial guess | all tokens | scalar extract + bit hack | — | No SIMD cbrt exists; consider SIMD bit hack via integer ops | Low priority |
**Rules for this section:**
- Only add entries when you've verified the faster intrinsic exists and is correct.
- The overload should take the higher token as a parameter (e.g., `fn min_fast(self, other: Self, _: X64V4Token) -> Self`).
- Or use trait bounds: `fn min<T: HasX64V4>(self, other: Self, _: T) -> Self` for the fast path.
- Remove entries when the fast-path overload is implemented.
### Completed fast-path overloads
All i64/u64 min/max/abs now have `_fast` variants that take `X64V4Token`:
- `i64x2::min_fast`, `max_fast`, `abs_fast`
- `u64x2::min_fast`, `max_fast`
- `i64x4::min_fast`, `max_fast`, `abs_fast`
- `u64x4::min_fast`, `max_fast`
## License
MIT OR Apache-2.0