# archmage
> Safely invoke your intrinsic power, using the tokens granted to you by the CPU. Cast primitive magics faster than any mage alive.
## CRITICAL: Naming Conventions
**Use the thematic names, not the boring ones:**
| `#[simd_fn]` | `#[arcane]` | `simd_fn` exists only for migration |
| `try_new()` | `summon()` | `try_new` exists only for migration |
**We are mages, not bureaucrats.** Write `Token::summon()`, not `Token::try_new()`.
## Reference: CPU Features, Detection, and Dispatch
### The Core Distinction: Compile-Time vs Runtime
| `#[cfg(target_arch = "...")]` | Compile | Include/exclude code from binary |
| `#[cfg(target_feature = "...")]` | Compile | True only if feature is in target spec |
| `#[cfg(feature = "...")]` | Compile | Cargo feature flag |
| `-Ctarget-cpu=native` | Compile | LLVM assumes current CPU's features |
| `is_x86_feature_detected!()` | Runtime | CPUID instruction |
| `Token::summon()` | Runtime | Archmage's detection (compiles away when guaranteed) |
**Tokens exist everywhere.** `Desktop64`, `Arm64`, etc. compile on all platforms—`summon()` just returns `None` on unsupported architectures. This means **you rarely need `#[cfg(target_arch)]` guards** in user code. The stubs handle cross-compilation cleanly.
### CRITICAL: Target-Feature Boundaries (4x Performance Impact)
**Enter `#[arcane]` once at the top, use `#[rite]` for everything inside.**
LLVM cannot inline across mismatched `#[target_feature]` attributes. Each `#[arcane]` call from non-SIMD code creates an optimization boundary — LLVM can't hoist loads, sink stores, or vectorize across it. This costs 4-6x depending on workload (see `benches/asm_inspection.rs` and `docs/PERFORMANCE.md`). Token hoisting doesn't help — even with the token pre-summoned, calling `#[arcane]` per iteration still hits the boundary.
```rust
// WRONG: #[arcane] boundary every iteration (4x slower)
#[arcane]
fn dist_simd(token: X64V3Token, a: &[f32; 8], b: &[f32; 8]) -> f32 { ... }
fn process_all(points: &[[f32; 8]]) {
let token = X64V3Token::summon().unwrap(); // hoisted — doesn't help!
for i in 0..points.len() {
for j in i+1..points.len() {
dist_simd(token, &points[i], &points[j]); // boundary per call
}
}
}
```
```rust
// RIGHT: one #[arcane] entry, #[rite] helpers inline freely
fn process_all(points: &[[f32; 8]]) {
if let Some(token) = X64V3Token::summon() {
process_all_simd(token, points);
} else {
process_all_scalar(points);
}
}
#[arcane]
fn process_all_simd(token: X64V3Token, points: &[[f32; 8]]) {
for i in 0..points.len() {
for j in i+1..points.len() {
dist_simd(token, &points[i], &points[j]); // #[rite] inlines here
}
}
}
#[rite]
fn dist_simd(token: X64V3Token, a: &[f32; 8], b: &[f32; 8]) -> f32 {
// Inlines into process_all_simd — same LLVM optimization region
...
}
```
**The rule:** `#[arcane]` at the entry point, `#[rite]` for everything called from SIMD code.
### CRITICAL: Generic Bounds Are Optimization Barriers
**Generic passthrough with trait bounds breaks inlining.** The compiler cannot inline across generic boundaries — each trait-bounded call is a potential indirect call.
```rust
// BAD: Generic bound prevents inlining into caller
#[arcane]
fn process_generic<T: HasX64V2>(token: T, data: &[f32]) -> f32 {
inner_work(token, data) // Can't inline — T could be any type
}
#[arcane]
fn inner_work<T: HasX64V2>(token: T, data: &[f32]) -> f32 {
// Even with #[inline(always)], this may not inline through generic
...
}
```
```rust
// GOOD: Concrete token enables full inlining
#[arcane]
fn process_concrete(token: X64V3Token, data: &[f32]) -> f32 {
inner_work(token, data) // Fully inlinable — concrete type
}
#[arcane]
fn inner_work(token: X64V3Token, data: &[f32]) -> f32 {
// Inlines into caller, single #[target_feature] region
...
}
```
**Why this matters:**
- `#[target_feature]` functions inline to share the feature-enabled region
- Generic bounds break this chain — each function is a separate compilation unit
- Even `#[inline(always)]` can't force inlining across trait object boundaries
**Downcasting is free:** Pass a higher token to a function expecting a lower one. Nested `#[arcane]` with downcasting preserves the inlining chain:
```rust
#[arcane]
fn v4_kernel(token: X64V4Token, data: &mut [f32]) {
// Can call V3 functions — V4 is a superset
let partial = v3_helper(token, &data[..8]); // Downcasts, still inlines
// ... AVX-512 specific work ...
}
#[arcane]
fn v3_helper(token: X64V3Token, chunk: &[f32]) -> f32 {
// AVX2+FMA work — inlines into v4_kernel
...
}
```
**Upcasting via `IntoConcreteToken`:** Safe, but creates an LLVM optimization boundary:
```rust
fn process<T: IntoConcreteToken>(token: T, data: &mut [f32]) {
// Generic caller has baseline LLVM target
if let Some(v4) = token.as_x64v4() {
process_v4(v4, data); // Callee has AVX-512 target — mismatched
} else if let Some(v3) = token.as_x64v3() {
process_v3(v3, data);
}
}
```
The issue: `#[target_feature]` changes LLVM's target for that function. Generic caller and feature-enabled callee have mismatched targets, so LLVM can't optimize across that boundary. Do dispatch once at entry, not deep in hot code.
**The rule:** Use concrete tokens for hot paths. Downcasting (V4→V3) is free. Upcasting via `IntoConcreteToken` is safe but creates optimization boundaries.
### When `-Ctarget-cpu=native` Is Fine
**Use it when:** Building for your own machine or known deployment target.
```bash
RUSTFLAGS="-Ctarget-cpu=native" cargo build --release
```
**Detection compiles away:** With `-Ctarget-cpu=haswell`, `X64V3Token::compiled_with()` returns `Some(true)` and `summon()` becomes a no-op. The compiler elides the check entirely.
**Don't use it when:** Distributing binaries to unknown CPUs.
### How `#[cfg(target_feature)]` Actually Works
```rust
// TRUE only if compiled with -Ctarget-cpu=haswell or -Ctarget-feature=+avx2
#[cfg(target_feature = "avx2")]
fn only_with_avx2_target() { }
// ALWAYS true on x86_64 (baseline)
#[cfg(target_feature = "sse2")]
fn always_on_x86_64() { }
```
Default `x86_64-unknown-linux-gnu` only enables SSE/SSE2. Extended features require `-Ctarget-cpu` or `-Ctarget-feature`.
### The Cargo Feature Trap
**WRONG:** Gating type aliases on cargo features:
```rust
// BAD: Types don't exist unless cargo feature enabled!
#[cfg(all(target_arch = "x86_64", feature = "avx512"))]
pub use crate::simd::x86::w512::f32x16 as F32Vec;
```
This breaks runtime dispatch — types aren't available even if CPU supports AVX-512.
**Cargo features should control:**
- Whether to *attempt* higher tiers at runtime
- Compile-time-only paths for known targets
**Cargo features should NOT control:**
- Whether SIMD types exist
### `#[arcane]`: Cross-Arch Compilation
On wrong architecture, generates unreachable stub:
```rust
// On ARM: stub that compiles but can't be reached
#[cfg(not(target_arch = "x86_64"))]
fn process(token: X64V3Token, data: &[f32; 8]) -> [f32; 8] {
unreachable!("X64V3Token cannot exist on this architecture")
}
```
### `#[arcane]` with Methods
Use `_self = Type` and reference `_self` in body:
```rust
impl Processor {
#[arcane(_self = Processor)]
fn process(&self, token: X64V3Token, data: &[f32; 8]) -> f32 {
_self.threshold // Use _self, not self
}
}
```
### `incant!`: Dispatch Macro
```rust
use archmage::incant;
pub fn sum(data: &[f32]) -> f32 {
incant!(sum(data))
}
// Requires suffixed functions:
// sum_v3(token: X64V3Token, ...)
// sum_v4(token: X64V4Token, ...) // if feature = "avx512"
// sum_neon(token: NeonToken, ...)
// sum_wasm128(token: Wasm128Token, ...)
// sum_scalar(token: ScalarToken, ...)
```
**Passthrough mode** (already have token):
```rust
fn inner<T: IntoConcreteToken>(token: T, data: &[f32]) -> f32 {
incant!(with token => process(data))
}
```
### `ScalarToken`
Always-available fallback. Used for:
- `incant!()` convention (`_scalar` suffix)
- Consistent API shape in dispatch
### Fixed-Size Types with Polyfills
**Pick a concrete size. Use polyfills for portability.**
```rust
use magetypes::simd::f32x8; // Always 8 lanes, polyfilled on ARM/WASM
#[arcane]
fn process(token: X64V3Token, data: &[f32; 8]) -> f32 {
let v = f32x8::load(token, data);
v.reduce_add()
}
```
On ARM, `f32x8` is emulated with two `f32x4` operations. The API is identical.
---
## CRITICAL: Token/Trait Design (DO NOT MODIFY)
### LLVM x86-64 Microarchitecture Levels
| **v1** | SSE, SSE2 (baseline) | None | None (always available) |
| **v2** | + SSE3, SSSE3, SSE4.1, SSE4.2, POPCNT | `X64V2Token` | `HasX64V2` |
| **v3** | + AVX, AVX2, FMA, BMI1, BMI2, F16C | `X64V3Token` / `Desktop64` | Use token directly |
| **v4** | + AVX512F, AVX512BW, AVX512CD, AVX512DQ, AVX512VL | `X64V4Token` / `Avx512Token` | `HasX64V4` |
| **Modern** | + VPOPCNTDQ, IFMA, VBMI, VNNI, BF16, VBMI2, BITALG, VPCLMULQDQ, GFNI, VAES | `Avx512ModernToken` | Use token directly |
| **FP16** | AVX512FP16 (independent) | `Avx512Fp16Token` | Use token directly |
### AArch64 Tokens
| `NeonToken` / `Arm64` | neon | `HasNeon` |
| `NeonAesToken` | + aes | `HasNeonAes` |
| `NeonSha3Token` | + sha3 | `HasNeonSha3` |
| `NeonCrcToken` | + crc | Use token directly |
**PROHIBITED:** NO SVE/SVE2 - Rust stable doesn't support SVE intrinsics yet.
### Rules
1. **NO granular x86 traits** - No `HasSse`, `HasSse2`, `HasAvx`, `HasAvx2`, `HasFma`, `HasAvx512f`, `HasAvx512bw`, etc.
2. **Use tier tokens** - `X64V2Token`, `X64V3Token`, `X64V4Token`, `Avx512ModernToken`
3. **Single trait per tier** - `HasX64V2`, `HasX64V4` only
4. **NEON includes fp16** - They always appear together on AArch64
5. **NO SVE** - `SveToken`, `Sve2Token`, `HasSve`, `HasSve2` are PROHIBITED (Rust stable lacks SVE support)
6. **NO WIDTH TRAITS** - `Has128BitSimd`, `Has256BitSimd`, `Has512BitSimd` are DEPRECATED and will be removed:
- `Has256BitSimd` only enables AVX, **NOT AVX2 or FMA** — misleading and causes suboptimal codegen
- Use concrete tokens (`X64V3Token`) or feature traits (`HasX64V2`, `HasX64V4`) instead
---
## CRITICAL: Documentation Examples
### Prefer `#[rite]` for internal code, `#[arcane]` only at entry points
**`#[rite]` should be the default.** It adds `#[target_feature]` + `#[inline]` — LLVM inlines it into any caller with matching features.
Use `#[arcane]` only when the function is called from non-SIMD code:
- After `summon()` in a public API
- From tests
- From non-`#[target_feature]` contexts
```rust
// Entry point (called after summon) - use #[arcane]
#[arcane]
pub fn process(token: Desktop64, data: &mut [f32]) {
for chunk in data.chunks_exact_mut(8) {
process_chunk(token, chunk); // Calls #[rite]
}
}
// Internal helper (called from #[arcane] or #[rite]) - use #[rite]
#[rite]
fn process_chunk(_: Desktop64, chunk: &mut [f32; 8]) {
// ...
}
```
### Never use manual `#[target_feature]`
**DO NOT write examples with manual `#[target_feature]` + unsafe wrappers.** Use `#[arcane]` or `#[rite]` instead.
```rust
// WRONG - manual #[target_feature] wrapping
#[cfg(target_arch = "x86_64")]
#[inline]
#[target_feature(enable = "avx2", enable = "fma")]
unsafe fn process_inner(data: &[f32]) -> f32 { ... }
#[cfg(target_arch = "x86_64")]
fn process(token: X64V3Token, data: &[f32]) -> f32 {
unsafe { process_inner(data) }
}
// CORRECT - use #[arcane] (generates #[target_feature] + stubs on other arches)
#[arcane]
fn process(token: X64V3Token, data: &[f32]) -> f32 {
// This function body is compiled with #[target_feature(enable = "avx2,fma")]
// Intrinsics and operators inline properly into single SIMD instructions
...
}
```
### Use `safe_unaligned_simd` inside `#[arcane]` functions
**Use `safe_unaligned_simd` directly inside `#[arcane]` functions.** The calls are safe because the target features match.
```rust
// WRONG - raw pointers need unsafe
let v = unsafe { _mm256_loadu_ps(data.as_ptr()) };
// CORRECT - use safe_unaligned_simd (safe inside #[arcane])
let v = safe_unaligned_simd::x86_64::_mm256_loadu_ps(data);
```
## Quick Start
```bash
cargo test # Run tests
cargo test --all-features # Test with all integrations
cargo clippy --all-features # Lint
just generate # Regenerate all generated code
just validate-registry # Validate token-registry.toml
just validate-tokens # Validate magetypes safety + summon() checks
just parity # Check API parity across x86/ARM/WASM
just soundness # Static intrinsic soundness verification
just miri # Run magetypes under Miri (detects UB)
just audit # Scan for safety-critical code
just intrinsics-refresh # Re-extract intrinsics from current Rust
just ci # Run ALL checks (must pass before push/publish)
```
## CI and Publishing Rules
**ABSOLUTE REQUIREMENT: Run `just ci` (or `just all` or `cargo xtask all`) before ANY push or publish.**
```bash
just ci # or: just all, cargo xtask ci, cargo xtask all
```
**NEVER run `git push` or `cargo publish` until this passes. No exceptions.**
CI checks (all must pass):
1. `cargo xtask generate` — regenerate all code
2. **Clean worktree check** — no uncommitted changes after generation (HARD FAIL)
3. `cargo xtask validate` — intrinsic safety + summon() feature verification
4. `cargo xtask parity` — parity check (0 issues remaining)
5. `cargo clippy --features "std macros avx512"` — zero warnings
6. `cargo test --features "std macros avx512"` — all tests pass
7. `cargo fmt --check` — code is formatted
**Note:** Parity check reports 0 issues. All W128 types have identical APIs across x86/ARM/WASM.
If ANY check fails:
- Do NOT push
- Do NOT publish
- Fix the issue first
- Re-run `just ci` until it passes
**Git tags are MANDATORY for every publish.** After `cargo publish`, immediately create tags:
```bash
git tag v{version} # archmage
git tag archmage-macros-v{version} # archmage-macros
git tag magetypes-v{version} # magetypes
git push origin v{version} archmage-macros-v{version} magetypes-v{version}
```
Publish order (respect dependency chain): `archmage-macros` → `archmage` → `magetypes`.
## Source of Truth: token-registry.toml
All token definitions, feature sets, trait mappings, and width configurations
live in `token-registry.toml`. Everything else is derived:
- `src/tokens/generated/` — token structs, traits, stubs, generated by xtask
- `archmage-macros/src/generated/` — macro registry, generated by xtask
- `magetypes/src/simd/generated/` — SIMD types, generated by xtask
- `docs/generated/` — intrinsics reference docs, generated by xtask
- `xtask/src/main.rs` validation — reads registry at runtime
- `cargo xtask validate` — verifies summon() checks match registry
- `cargo xtask parity` — checks API parity across architectures
To add/modify tokens: edit `token-registry.toml`, then `just generate`.
## Core Insight: Rust 1.85+ Changed Everything
As of Rust 1.85, **value-based intrinsics are safe inside `#[target_feature]` functions**:
```rust
#[target_feature(enable = "avx2")]
unsafe fn example() {
let a = _mm256_setzero_ps(); // SAFE!
let b = _mm256_add_ps(a, a); // SAFE!
let c = _mm256_fmadd_ps(a, a, a); // SAFE!
// Only memory ops remain unsafe (raw pointers)
let v = unsafe { _mm256_loadu_ps(ptr) }; // Still needs unsafe
}
```
This means we **don't need to wrap** arithmetic, shuffle, compare, bitwise, or other value-based intrinsics. Only:
1. **Tokens** - Prove CPU features are available
2. **`#[arcane]` macro** - Enable `#[target_feature]` via token proof
3. **`safe_unaligned_simd`** - Reference-based memory operations (user adds as dependency)
## How `#[arcane]` Works
The macro generates an inner function with `#[target_feature]`:
```rust
// You write:
#[arcane]
fn my_kernel(token: X64V3Token, data: &[f32; 8]) -> [f32; 8] {
let v = _mm256_setzero_ps(); // Safe!
// ...
}
// Macro generates:
fn my_kernel(token: X64V3Token, data: &[f32; 8]) -> [f32; 8] {
#[target_feature(enable = "avx2,fma")]
fn inner(data: &[f32; 8]) -> [f32; 8] {
let v = _mm256_setzero_ps(); // Safe inside #[target_feature]!
// ...
}
// SAFETY: Calling #[target_feature] fn from non-matching context.
// Token proves CPU support was verified via summon().
unsafe { inner(data) }
}
```
## Friendly Aliases
| `Desktop64` | `X64V3Token` | AVX2 + FMA (Haswell 2013+, Zen 1+) |
| `Server64` | `X64V4Token` | + AVX-512 (Xeon 2017+, Zen 4+) |
| `Arm64` | `NeonToken` | NEON + FP16 (all 64-bit ARM) |
```rust
use archmage::{Desktop64, SimdToken, arcane};
#[arcane]
fn process(token: Desktop64, data: &mut [f32; 8]) {
// AVX2 + FMA intrinsics safe here
}
if let Some(token) = Desktop64::summon() {
process(token, &mut data);
}
```
## Directory Structure
```
token-registry.toml # THE source of truth for all token/trait/feature data
spec.md # Architecture spec and safety model documentation
archmage/ # Main crate: tokens, macros, detect
├── src/
│ ├── lib.rs # Main exports
│ ├── tokens/ # SIMD capability tokens
│ │ ├── mod.rs # SimdToken trait definition only
│ │ └── generated/ # Generated from token-registry.toml
│ │ ├── mod.rs # cfg-gated module routing + re-exports
│ │ ├── traits.rs # Marker traits (Has128BitSimd, HasX64V2, etc.)
│ │ ├── x86.rs # x86 tokens (v2, v3) + detection
│ │ ├── x86_avx512.rs # AVX-512 tokens (v4, modern, fp16)
│ │ ├── arm.rs # ARM tokens + detection
│ │ ├── wasm.rs # WASM tokens + detection
│ │ ├── x86_stubs.rs # x86 stubs (summon → None)
│ │ ├── arm_stubs.rs # ARM stubs
│ │ └── wasm_stubs.rs # WASM stubs
archmage-macros/ # Proc-macro crate (#[arcane], #[rite], #[magetypes], incant!)
└── src/
├── lib.rs # Macro implementation
└── generated/ # Generated from token-registry.toml
├── mod.rs # Re-exports
└── registry.rs # Token→features mappings
magetypes/ # SIMD types crate (depends on archmage)
├── src/
│ ├── lib.rs # Exports simd module
│ └── simd/
│ ├── mod.rs # Re-exports from generated/
│ └── generated/ # Auto-generated SIMD types
│ ├── x86/ # x86-64 types (w128, w256, w512)
│ ├── arm/ # AArch64 types (w128)
│ ├── wasm/ # WASM types (w128)
│ └── polyfill.rs # Width emulation
docs/
└── generated/ # Auto-generated reference docs
├── x86-intrinsics-by-token.md
├── aarch64-intrinsics-by-token.md
└── memory-ops-reference.md
xtask/ # Code generator and validation
└── src/
├── main.rs # Generates everything, validates safety, parity check
├── registry.rs # token-registry.toml parser
└── token_gen.rs # Token/trait code generator
```
## CRITICAL: Codegen Style Rules — NO `writeln!` CHAINS
**THIS IS MANDATORY. ALL codegen MUST use `formatdoc!` from the `indoc` crate.**
`writeln!` chains are the single biggest readability problem in our codegen. They turn 10 lines of readable Rust into 40 lines of string-escaping noise where you can't see the generated code's structure. Every `{{` and `}}` is a bug waiting to happen. Every `.unwrap()` is visual clutter. Stop it.
### The rule
1. **Use `formatdoc!` with raw strings** for any block of generated code (2+ lines)
2. **Use `format!` or string literals** for single-line fragments only
3. **`writeln!` is BANNED** except for trivial single-line output to stdout/stderr (like progress messages)
4. **`indoc` is already a dependency** of xtask — there is zero excuse
### Pattern: `formatdoc!` with `push_str`
```rust
use indoc::formatdoc;
// WRONG — unreadable writeln! soup (actual current state of token_gen.rs)
writeln!(code, "impl SimdToken for {name} {{").unwrap();
writeln!(code, " const NAME: &'static str = \"{display_name}\";").unwrap();
writeln!(code, "").unwrap();
writeln!(code, " fn compiled_with() -> Option<bool> {{").unwrap();
writeln!(code, " #[cfg(all({cfg_all}))]").unwrap();
writeln!(code, " {{ Some(true) }}").unwrap();
writeln!(code, " #[cfg(not(all({cfg_all})))]").unwrap();
writeln!(code, " {{ None }}").unwrap();
writeln!(code, " }}").unwrap();
writeln!(code, "}}").unwrap();
// CORRECT — you can actually READ the generated code
code.push_str(&formatdoc! {r#"
impl SimdToken for {name} {{
const NAME: &'static str = "{display_name}";
fn compiled_with() -> Option<bool> {{
#[cfg(all({cfg_all}))]
{{ Some(true) }}
#[cfg(not(all({cfg_all})))]
{{ None }}
}}
}}
"#});
```
### Pattern: conditional blocks
```rust
// Build a section conditionally, then splice it in
let cascade_code = if !descendants.is_empty() {
let mut lines = String::new();
for desc in &descendants {
lines.push_str(&formatdoc! {r#"
super::{module}::{cache}.store(v, Ordering::Relaxed);
super::{module}::{disabled}.store(disabled, Ordering::Relaxed);
"#, module = desc.module, cache = desc.cache_var, disabled = desc.disabled_var});
}
lines
} else {
String::new()
};
code.push_str(&formatdoc! {r#"
pub fn disable(disabled: bool) {{
CACHE.store(if disabled {{ 1 }} else {{ 0 }}, Ordering::Relaxed);
{cascade_code}
}}
"#});
```
### Pattern: helper functions that return String
```rust
fn gen_compiled_with(name: &str, cfg_all: &str) -> String {
formatdoc! {r#"
fn compiled_with() -> Option<bool> {{
#[cfg(all({cfg_all}))]
{{ Some(true) }}
#[cfg(not(all({cfg_all})))]
{{ None }}
}}
"#}
}
```
### For magetypes method generation
Use the helpers in `xtask/src/simd_types/types.rs`:
```rust
use super::types::{gen_unary_method, gen_binary_method, gen_scalar_method};
code.push_str(&gen_unary_method("Compute absolute value", "abs", "Self(_mm256_abs_epi32(self.0))"));
code.push_str(&gen_binary_method("Add two vectors", "add", "Self(_mm256_add_epi32(self.0, other.0))"));
code.push_str(&gen_scalar_method("Extract first element", "first", "i32", "_mm_cvtsi128_si32(self.0)"));
```
### Enforcement
When touching ANY codegen file, convert `writeln!` chains to `formatdoc!` in the same commit. Don't add new `writeln!` chains. Existing `writeln!` chains in `token_gen.rs` (297 occurrences!) and `main.rs` (33 occurrences) are tech debt — convert them progressively.
## Token Hierarchy
**x86:**
- `X64V2Token` - SSE4.2 + POPCNT (Nehalem 2008+)
- `X64V3Token` / `Desktop64` / `X64V3Token` - AVX2 + FMA + BMI2 (Haswell 2013+, Zen 1+)
- `X64V4Token` / `Avx512Token` - + AVX-512 F/BW/CD/DQ/VL (Skylake-X 2017+, Zen 4+)
- `Avx512ModernToken` - + modern extensions (Ice Lake 2019+, Zen 4+)
- `Avx512Fp16Token` - + FP16 (Sapphire Rapids 2023+)
**ARM:**
- `NeonToken` / `Arm64` - NEON (virtually all AArch64, requires runtime detection)
- `NeonAesToken` - + AES
- `NeonSha3Token` - + SHA3
- `NeonCrcToken` - + CRC
## Tier Traits
Only two tier traits exist for generic bounds:
```rust
fn requires_v2(token: impl HasX64V2) { ... }
fn requires_v4(token: impl HasX64V4) { ... }
fn requires_neon(token: impl HasNeon) { ... }
```
For v3 (AVX2+FMA), use `X64V3Token` directly - it's the recommended baseline.
## SIMD Types (magetypes crate)
Token-gated SIMD types live in the **magetypes** crate. **Use fixed-size types:**
```rust
use archmage::{X64V3Token, SimdToken, arcane};
use magetypes::simd::f32x8;
pub fn process(data: &[f32; 8]) -> f32 {
if let Some(token) = X64V3Token::summon() {
process_simd(token, data)
} else {
data.iter().sum()
}
}
#[arcane]
fn process_simd(token: X64V3Token, data: &[f32; 8]) -> f32 {
let a = f32x8::load(token, data);
let b = f32x8::splat(token, 2.0);
let c = a * b;
c.reduce_add()
}
```
On ARM/WASM, `f32x8` is polyfilled with two `f32x4` operations. Pick the size that fits your algorithm.
## Safe Memory Operations
Use `safe_unaligned_simd` directly inside `#[arcane]` functions:
```rust
use archmage::{Desktop64, SimdToken, arcane};
#[arcane]
fn process(_token: Desktop64, data: &[f32; 8]) -> [f32; 8] {
// safe_unaligned_simd calls are SAFE inside #[arcane]
let v = safe_unaligned_simd::x86_64::_mm256_loadu_ps(data);
let squared = _mm256_mul_ps(v, v);
let mut out = [0.0f32; 8];
safe_unaligned_simd::x86_64::_mm256_storeu_ps(&mut out, squared);
out
}
```
## Pending Work
### API Parity Status (0 issues — complete!)
**Current state:** All W128 types have identical APIs across x86/ARM/WASM. Reduced from 270 → 0 parity issues (100%).
Run `cargo xtask parity` to verify.
### Known Cross-Architecture Behavioral Differences
These are documented semantic differences between architectures. Tests must account for them; they are not bugs to fix.
| Bitwise operators (`&`, `\|`, `^`) on integers | Trait impls (operators work) | Methods only | Methods only | Use `.and()`, `.or()`, `.xor()` methods |
| `shr` for signed integers | Logical (zero-fill) | Arithmetic (sign-extend) | Arithmetic (sign-extend) | Use `shr_arithmetic` for portable sign-extending shift |
| `blend` signature | `(mask, true, false)` | `(mask, true, false)` | `(self, other, mask)` | Avoid in portable code; use bitcast + comparison verification |
| `interleave_lo/hi` | f32x4 only | f32x4 only | f32x4 only | Only use on f32x4, not integer types |
### Long-Term
- **Generator test fixtures**: Add example input/expected output pairs to each xtask generator (SIMD types, width dispatch, tokens, macro registry). These serve as both documentation of expected output and cross-platform regression tests — run on x86, ARM, and WASM to catch codegen divergence.
- ~~**Target-feature boundary overhead benchmark**~~: Done. See `benches/asm_inspection.rs` and `docs/PERFORMANCE.md`. Key results:
- Simple vector add (1000 x 8-float): `#[rite]` in `#[arcane]` 547 ns, `#[arcane]` per iteration 2209 ns (4x), bare `#[target_feature]` 2222 ns (4x, identical)
- DCT-8 (100 rows x 8 dot products): `#[rite]` in `#[arcane]` 61 ns, `#[arcane]` per row 376 ns (6.2x), bare `#[target_feature]` 374 ns (6.2x, identical)
- Cross-token nesting: downgrade (V4→V3, V3→V2) is free, upgrade (V2→V3, V3→V4) costs 4x, all patterns match bare `#[target_feature]`
Key insight: the overhead is from the `#[target_feature]` optimization boundary, NOT from wrappers or archmage abstractions. The cost scales with computational density (4x simple add, 6.2x DCT-8). Feature direction matters: downgrades are free (superset enables inlining), upgrades hit the boundary.
- ~~**summon() caching**~~: **Implemented!** See `benches/summon_overhead.rs`. Results after adding atomic caching:
- `Desktop64::summon()` (cached): ~1.3 ns (was 2.6 ns — **2x faster**)
- `Avx512ModernToken::summon()` (cached): ~1.3 ns (was 7.2 ns — **6x faster**)
- With `-Ctarget-cpu=haswell`: 0 ns (compiles away entirely)
Implementation: Each token has a static `AtomicU8` cache (0=unknown, 1=unavailable, 2=available). Compile-time `#[cfg(target_feature)]` guard skips the cache entirely when features are guaranteed.
### safe_unaligned_simd Gaps (discovered in rav1d-safe refactoring)
Found during pal.rs refactoring to use `#[arcane]` + `safe_unaligned_simd`:
- **SOLVED: Created `partial_simd` module in rav1d-safe** with `Is64BitsUnaligned` trait:
```rust
#[target_feature(enable = "sse2")]
pub fn mm_loadl_epi64<T: Is64BitsUnaligned>(src: &T) -> __m128i {
unsafe { _mm_loadl_epi64(ptr::from_ref(src).cast()) }
}
```
- Generates identical `vmovq` instructions (zero overhead)
- Pattern ready for upstream to safe_unaligned_simd
- **Verified: No overhead from slice-to-array conversion**
- `slice[..32].try_into().unwrap()` optimizes away completely
- `safe_simd::_mm256_loadu_si256(arr)` → same `vmovdqu` as raw pointer
### Completed
- ~~**Type implementation verification**~~: Done. Added `implementation_name() -> &'static str` to all magetypes vectors. Uses tier-based naming: `"x86::v3::f32x8"`, `"x86::v4::f32x16"`, `"arm::neon::f32x4"`, `"wasm::wasm128::f32x4"`, `"polyfill::v3::f32x8"`, `"polyfill::v3_512::f32x16"`, `"polyfill::neon::f32x8"`. Test in `tests/exhaustive_intrinsics.rs`.
- ~~**WASM u64x2 ordering comparisons**~~: Done. Added simd_lt/le/gt/ge via bias-to-signed polyfill (XOR with i64::MIN, then i64x2_lt/gt). Parity: 4 → 0.
- ~~**x86 byte shift polyfills**~~: Done. Added i8x16/u8x16 shl, shr, shr_arithmetic for all x86 widths. Uses 16-bit shift + byte mask (~2 instructions). AVX-512 shr_arithmetic uses mask registers. Parity: 9 → 4.
- ~~**All actionable parity issues**~~: Done. Closed 28 remaining issues: extend/pack ops (17), RGBA pixel ops (4), i64/u64 polyfill math (7). Parity: 37 → 9 (0 actionable).
- ~~**ARM/WASM block ops**~~: Done. ARM uses native vzip1q/vzip2q, WASM uses i32x4_shuffle. Both gained interleave_lo/hi, interleave, deinterleave_4ch, interleave_4ch, transpose_4x4, transpose_4x4_copy. Parity: 47 → 37.
- ~~**WASM cbrt + f64x2 log10_lowp**~~: Done. WASM f32x4 gained cbrt_midp/cbrt_midp_precise (scalar initial guess + Newton-Raphson). WASM f64x2 gained log10_lowp via scalar fallback.
- ~~**ARM transcendentals + x86 missing variants**~~: Done. ARM f32x4 has full lowp+midp transcendentals (log2, exp2, ln, exp, log10, pow, cbrt) with all variant coverage. ARM f64x2 has lowp transcendentals via scalar fallback. x86 gained lowp _unchecked aliases, midp _precise variants, and log10_midp family. Parity: 80 → 47.
- ~~**API surface parity detection tool**~~: Done. Use `cargo xtask parity` to detect API variances between x86/ARM/WASM.
- ~~**Move generated files to subfolder**~~: Done. All generated code now lives in `generated/` subfolders.
- ~~**Merge WASM transcendentals from `feat/wasm128`**~~: Done (354dc2b). All `_unchecked` and `_precise` variants now generated.
- ~~**ARM comparison ops**~~: Done. Added simd_eq, simd_ne, simd_lt, simd_le, simd_gt, simd_ge, blend.
- ~~**ARM bitwise ops**~~: Done. Added not, shl, shr for all integer types.
- ~~**ARM boolean reductions**~~: Done. Added all_true, any_true, bitmask for all integer types.
- ~~**x86 boolean reductions**~~: Done. Added all_true, any_true, bitmask for all integer types (128/256/512-bit).
- ~~**WASM token-gated casting methods**~~: Done. Added cast_slice, cast_slice_mut, as_bytes, as_bytes_mut, from_bytes, from_bytes_owned (token-gated replacements for bytemuck, NOT actual Pod/Zeroable implementations).
- ~~**ARM reduce_add for unsigned**~~: Done. Extended reduce_add to all integer types including unsigned.
- ~~**Approximations (rcp, rsqrt) for ARM/WASM**~~: Done. ARM uses native vrecpe/vrsqrte, WASM uses division.
- ~~**mul_sub for ARM/WASM**~~: Done. ARM uses vfma with negation, WASM uses mul+sub.
- ~~**Type conversions for ARM/WASM**~~: Done. Added to_i32x4, to_i32x4_round, from_i32x4, to_f32x4, to_i32x4_low.
- ~~**shr_arithmetic for ARM/WASM**~~: Done. Added for i8x16, i16x8, i32x4.
## Suboptimal Intrinsics (needs faster-path overloads)
Track places where we use polyfills or slower instruction sequences because the base token lacks a native intrinsic, but a higher token would have one. Each entry should get a method overload that accepts the higher token for the fast path.
| f32 cbrt initial guess | all tokens | scalar extract + bit hack | — | No SIMD cbrt exists; consider SIMD bit hack via integer ops | Low priority |
**Rules for this section:**
- Only add entries when you've verified the faster intrinsic exists and is correct.
- The overload should take the higher token as a parameter (e.g., `fn min_fast(self, other: Self, _: X64V4Token) -> Self`).
- Or use trait bounds: `fn min<T: HasX64V4>(self, other: Self, _: T) -> Self` for the fast path.
- Remove entries when the fast-path overload is implemented.
### Completed fast-path overloads
All i64/u64 min/max/abs now have `_fast` variants that take `X64V4Token`:
- `i64x2::min_fast`, `max_fast`, `abs_fast`
- `u64x2::min_fast`, `max_fast`
- `i64x4::min_fast`, `max_fast`, `abs_fast`
- `u64x4::min_fast`, `max_fast`
## License
MIT OR Apache-2.0