archmage

Browse 12,000+ SIMD Intrinsics → · Docs · Magetypes · API Docs

Archmage lets you write SIMD code in Rust without unsafe. It works on x86-64, AArch64, and WASM.

[dependencies]
archmage = "0.9"

The problem

Raw SIMD in Rust requires unsafe for every intrinsic call:

use std::arch::x86_64::*;

// Every. Single. Call.
unsafe {
    let a = _mm256_loadu_ps(data.as_ptr());      // unsafe: raw pointer
    let b = _mm256_set1_ps(2.0);                  // unsafe: needs target_feature
    let c = _mm256_mul_ps(a, b);                  // unsafe: needs target_feature
    _mm256_storeu_ps(out.as_mut_ptr(), c);         // unsafe: raw pointer
}

Miss a feature check and you get undefined behavior on older CPUs. Wrap everything in unsafe and hope for the best.

The solution

use archmage::prelude::*;  // tokens, traits, macros, intrinsics, safe memory ops

// X64V3Token = AVX2 + FMA (Haswell 2013+, Zen 1+)
#[arcane]
fn multiply(_token: X64V3Token, data: &[f32; 8]) -> [f32; 8] {
    let a = _mm256_loadu_ps(data);          // safe: takes &[f32; 8], not *const f32
    let b = _mm256_set1_ps(2.0);            // safe: inside #[target_feature]
    let c = _mm256_mul_ps(a, b);            // safe: value-based (Rust 1.85+)
    let mut out = [0.0f32; 8];
    _mm256_storeu_ps(&mut out, c);          // safe: takes &mut [f32; 8]
    out
}

fn main() {
    // Runtime CPU check — returns None if AVX2+FMA unavailable
    if let Some(token) = X64V3Token::summon() {
        let result = multiply(token, &[1.0; 8]);
        println!("{:?}", result);
    }
}

No unsafe anywhere. Your crate can use #![forbid(unsafe_code)].

What changed in Rust 1.85

Rust 1.85 (Feb 2025) stabilized target_feature_11, which changed two things:

#[target_feature] functions can be declared safe (Rust 2024 edition). Previously they had to be unsafe fn. Now they're just fn — but calling them from code without matching features still requires unsafe.
Value-based intrinsics are safe inside #[target_feature] functions. _mm256_add_ps, _mm256_mul_ps, shuffles, compares — anything that doesn't touch a pointer is safe to call when the compiler knows the CPU features are enabled. Only pointer-based operations (_mm256_loadu_ps(ptr), _mm256_storeu_ps(ptr, v)) remain unsafe.

This means the only unsafe left in SIMD code is: (a) the one call site where you cross from normal code into a #[target_feature] function, and (b) memory operations that take raw pointers. Archmage eliminates both.

How archmage makes it zero-unsafe

┌─────────── Your crate: #![forbid(unsafe_code)] ───────────┐
│                                                            │
│   summon() ──→ Token ──→ #[arcane] fn ──→ #[rite] fn      │
│    (CPUID)    (proof)   (entry point)   (inlines freely)  │
│                              │                             │
│                   prelude / import_intrinsics               │
│                    ┌────────┴────────┐                     │
│                    │ core::arch    safe_unaligned_simd     │
│                    │ (value ops    (memory ops             │
│                    │  = safe)       = safe)                │
│                    └─────────────────┘                     │
│                                                            │
│   Result: all intrinsics are safe. No unsafe in your code. │
└────────────────────────────────────────────────────────────┘

Three pieces eliminate the remaining unsafe:

Tokens prove the CPU has the right features. X64V3Token::summon() checks CPUID and returns Some(token) only if AVX2+FMA are available. The token is zero-sized — passing it around costs nothing at runtime. Detection is cached (~1.3 ns), or compiles away entirely with -Ctarget-cpu=haswell.
#[arcane] / #[rite] read the token type from your function signature and generate #[target_feature(enable = "avx2,fma,...")]. The #[arcane] macro generates the unsafe call to cross the #[target_feature] boundary inside its own generated code — your crate never writes unsafe. #[rite] functions called from matching #[target_feature] contexts don't need unsafe at all (Rust 1.85+). Both macros handle #[cfg(target_arch)] gating automatically.
Safe memory ops from safe_unaligned_simd (by okaneco) shadow core::arch's pointer-based versions. _mm256_loadu_ps takes &[f32; 8] instead of *const f32. Same function names, safe signatures. These are included in the prelude, or you can use #[arcane(import_intrinsics)] to scope them to a single function.

All tokens compile on all platforms. On the wrong architecture, summon() returns None. You rarely need #[cfg(target_arch)] in your code.

Import styles

use archmage::prelude::* imports everything — tokens, traits, macros, all platform intrinsics, and safe memory ops. The examples in this README use the prelude for brevity:

use archmage::prelude::*;

#[arcane]  // intrinsics already in scope from prelude
fn example(_: X64V3Token, data: &[f32; 8]) -> __m256 {
    _mm256_loadu_ps(data)
}

If you don't want thousands of intrinsic names at module scope, use selective imports with import_intrinsics to scope them to the function body:

use archmage::{X64V3Token, SimdToken, arcane};

#[arcane(import_intrinsics)]  // injects intrinsics inside this function only
fn example(_: X64V3Token, data: &[f32; 8]) -> core::arch::x86_64::__m256 {
    _mm256_loadu_ps(data)  // in scope from import_intrinsics
}

Both import the same combined intrinsics module — using both is just duplication. The prelude is simpler; import_intrinsics is more explicit.

`#[arcane]` vs `#[rite]`: entry point vs internal

#[rite] should be your default. Use #[arcane] only at the entry point — the first call from non-SIMD code.

use archmage::prelude::*;

// Entry point: #[arcane] — safe wrapper for non-SIMD callers
#[arcane]
fn dot_product(token: X64V3Token, a: &[f32; 8], b: &[f32; 8]) -> f32 {
    let products = mul_vectors(token, a, b);
    horizontal_sum(token, products)
}

// Called from SIMD code: #[rite] — inlines into caller, no boundary
#[rite]
fn mul_vectors(_: X64V3Token, a: &[f32; 8], b: &[f32; 8]) -> __m256 {
    _mm256_mul_ps(_mm256_loadu_ps(a), _mm256_loadu_ps(b))
}

#[rite]
fn horizontal_sum(_: X64V3Token, v: __m256) -> f32 {
    let sum = _mm256_hadd_ps(v, v);
    let sum = _mm256_hadd_ps(sum, sum);
    let low = _mm256_castps256_ps128(sum);
    let high = _mm256_extractf128_ps::<1>(sum);
    _mm_cvtss_f32(_mm_add_ss(low, high))
}

Both macros read the token type to decide which #[target_feature] to emit. X64V3Token → avx2,fma,.... X64V4Token → avx512f,avx512bw,.... The token type is the feature selector.

Why two macros? #[arcane] generates a safe wrapper that crosses the #[target_feature] boundary — LLVM can't optimize across it. #[rite] adds #[target_feature] + #[inline] directly, so LLVM inlines it into the caller. Same token type = same features = no boundary.

Processing 1000 8-float vector additions (full benchmark details):

Pattern	Time	Why
`#[rite]` inside `#[arcane]`	547 ns	Features match — LLVM inlines
`#[arcane]` per iteration	2209 ns (4x)	Target-feature boundary per call
Bare `#[target_feature]` (no archmage)	2222 ns (4x)	Same boundary — archmage adds nothing

The 4x penalty is LLVM's #[target_feature] boundary, not archmage overhead. Bare #[target_feature] without archmage has the same cost. With real workloads (DCT-8), the boundary costs up to 6.2x.

The rule: #[arcane] once at the entry point, #[rite] for everything called from SIMD code. Pass the same token type through your call tree so features stay consistent.

For trait impls, use #[arcane(_self = Type)] — a nested inner-function approach (since sibling would add methods not in the trait definition).

Auto-vectorization with `#[autoversion]`

Don't want to write intrinsics? Write plain scalar code and let the compiler vectorize it:

use archmage::prelude::*;

#[autoversion]
fn sum_of_squares(_token: SimdToken, data: &[f32]) -> f32 {
    let mut sum = 0.0f32;
    for &x in data {
        sum += x * x;
    }
    sum
}

// Call directly — no token needed, no unsafe:
let result = sum_of_squares(&my_data);

#[autoversion] generates a separate copy of your function for each architecture tier — each compiled with #[target_feature] to unlock the auto-vectorizer — plus a runtime dispatcher that picks the best one. On x86-64 with AVX2+FMA, that loop compiles to vfmadd231ps (8 floats per cycle). On ARM, you get fmla. The _scalar fallback compiles without SIMD features as a safety net.

The _token: SimdToken parameter is a placeholder — you don't use it in the body. The macro replaces it with concrete token types (X64V3Token, NeonToken, etc.) for each variant.

What gets generated (default tiers):

sum_of_squares_v4(token: X64V4Token, ...) — AVX-512 (with avx512 feature)
sum_of_squares_v3(token: X64V3Token, ...) — AVX2+FMA
sum_of_squares_neon(token: NeonToken, ...) — AArch64 NEON
sum_of_squares_wasm128(token: Wasm128Token, ...) — WASM SIMD
sum_of_squares_scalar(token: ScalarToken, ...) — no SIMD
sum_of_squares(data: &[f32]) -> f32 — dispatcher (token param removed)

Explicit tiers: #[autoversion(v3, neon)]. scalar is always implicit.

For inherent methods, self works naturally — no special parameters needed. For trait method delegation, use #[autoversion(_self = MyType)] and _self in the body. See the full parameter reference or the API docs.

When to use which:

	`#[autoversion]`	`#[arcane]` + `#[rite]`
You write	Scalar loops	SIMD intrinsics
Vectorization	Compiler auto-vectorizes	You choose the instructions
Lines of code	1 attribute	Manual variant + dispatch
Best for	Simple numeric loops	Hand-tuned SIMD kernels

SIMD types with `magetypes`

magetypes provides ergonomic SIMD vector types (f32x8, i32x4, etc.) with natural Rust operators. It's an exploratory companion crate — the API may change between releases.

[dependencies]
archmage = "0.9"
magetypes = "0.9"

use archmage::prelude::*;
use magetypes::simd::f32x8;

fn dot_product(a: &[f32], b: &[f32]) -> f32 {
    if let Some(token) = X64V3Token::summon() {
        dot_product_simd(token, a, b)
    } else {
        a.iter().zip(b).map(|(x, y)| x * y).sum()
    }
}

#[arcane]
fn dot_product_simd(token: X64V3Token, a: &[f32], b: &[f32]) -> f32 {
    let mut sum = f32x8::zero(token);
    for (a_chunk, b_chunk) in a.chunks_exact(8).zip(b.chunks_exact(8)) {
        let va = f32x8::load(token, a_chunk.try_into().unwrap());
        let vb = f32x8::load(token, b_chunk.try_into().unwrap());
        sum = va.mul_add(vb, sum);
    }
    sum.reduce_add()
}

f32x8 wraps __m256 on x86 with AVX2. On ARM/WASM, it's polyfilled with two f32x4 operations — same API, automatic fallback. The #[arcane] wrapper lets LLVM optimize the entire loop as a single SIMD region.

Runtime dispatch with `incant!` (alias: `dispatch_variant!`)

Write platform-specific variants with concrete types, then dispatch at runtime:

use archmage::incant;
#[cfg(target_arch = "x86_64")]
use magetypes::simd::f32x8;

#[cfg(target_arch = "x86_64")]
fn sum_squares_v3(token: archmage::X64V3Token, data: &[f32]) -> f32 {
    let chunks = data.chunks_exact(8);
    let mut acc = f32x8::zero(token);
    for chunk in chunks {
        let v = f32x8::from_array(token, chunk.try_into().unwrap());
        acc = v.mul_add(v, acc);
    }
    acc.reduce_add() + chunks.remainder().iter().map(|x| x * x).sum::<f32>()
}

fn sum_squares_scalar(_token: archmage::ScalarToken, data: &[f32]) -> f32 {
    data.iter().map(|x| x * x).sum()
}

/// Dispatches to the best available at runtime.
fn sum_squares(data: &[f32]) -> f32 {
    incant!(sum_squares(data), [v3])
}

Each variant's first parameter is the matching token type — _v3 takes X64V3Token, _neon takes NeonToken, etc. A _scalar variant (taking ScalarToken) is always required as the fallback.

incant! wraps each tier's call in #[cfg(target_arch)] and #[cfg(feature)] guards, so you only define variants for architectures you target. With no explicit tier list, incant! dispatches to v3, neon, wasm128, and scalar by default (plus v4 if the avx512 feature is enabled).

Known tiers: v1, v2, x64_crypto, v3, v3_crypto, v4, v4x, arm_v2, arm_v3, neon, neon_aes, neon_sha3, neon_crc, wasm128, scalar.

If you already have a token, use with to dispatch on its concrete type: incant!(func(data) with token).

Tokens

Token	Alias	Features	Hardware
`X64V1Token`	`Sse2Token`	SSE, SSE2	x86_64 baseline (always available)
`X64V2Token`		+ SSE4.2, POPCNT	Nehalem 2008+
`X64CryptoToken`		V2 + PCLMULQDQ, AES-NI	Westmere 2010+
`X64V3Token`	—	+ AVX2, FMA, BMI2	Haswell 2013+, Zen 1+
`X64V3CryptoToken`		V3 + VPCLMULQDQ, VAES	Zen 3+ 2020, Alder Lake 2021+
`X64V4Token`	`Server64`	+ AVX-512 (requires `avx512` feature)	Skylake-X 2017+, Zen 4+
`NeonToken`	`Arm64`	NEON	All 64-bit ARM
`Arm64V2Token`		+ CRC, RDM, DotProd, FP16, AES, SHA2	A55+, M1+, Graviton 2+
`Arm64V3Token`		+ FHM, FCMA, SHA3, I8MM, BF16	A510+, M2+, Snapdragon X
`Wasm128Token`		WASM SIMD	Compile-time only
`ScalarToken`		(none)	Always available

Higher tokens subsume lower ones: X64V4Token → X64V3Token → X64V2Token → X64V1Token. Downcasting is free (zero-cost). #[arcane(stub)] generates unreachable stubs on non-matching architectures when you need cross-arch dispatch without #[cfg] guards. incant! handles cfg-gating automatically.

See token-registry.toml for the complete mapping of tokens to CPU features.

Testing SIMD dispatch paths

for_each_token_permutation tests every incant! dispatch path on your native hardware — no cross-compilation needed. It disables tokens one at a time, running your closure at each tier from "all SIMD enabled" down to "scalar only":

use archmage::testing::{for_each_token_permutation, CompileTimePolicy};

#[test]
fn sum_squares_matches_across_tiers() {
    let data: Vec<f32> = (0..1024).map(|i| i as f32).collect();
    let expected: f32 = data.iter().map(|x| x * x).sum();

    let report = for_each_token_permutation(CompileTimePolicy::Warn, |perm| {
        let result = sum_squares(&data);
        assert!(
            (result - expected).abs() < 1e-1,
            "mismatch at tier: {perm}"
        );
    });

    assert!(report.permutations_run >= 2, "expected multiple tiers");
}

On an AVX-512 machine this runs 5–7 permutations; on Haswell, 3. Tokens the CPU doesn't have are skipped.

If you compiled with -Ctarget-cpu=native, the compiler bakes feature detection into the binary and tokens can't be disabled at runtime. Use the testable_dispatch feature to force runtime detection in CI:

[dev-dependencies]
archmage = { version = "0.9", features = ["testable_dispatch"] }

For manual single-token testing, lock_token_testing() serializes against parallel tests. See the testing docs for CompileTimePolicy::Fail, env-var integration, and dangerously_disable_token_process_wide.

Feature flags

Feature	Default
`std`	yes	Standard library (required for runtime detection)
`macros`	yes	`#[arcane]`, `#[rite]`, `#[autoversion]`, `#[magetypes]`, `incant!`
`avx512`	no	AVX-512 tokens (`X64V4Token`, `X64V4xToken`, `Avx512Fp16Token`)
`testable_dispatch`	no	Makes token disabling work with `-Ctarget-cpu=native`

Acknowledgments

safe_unaligned_simd by okaneco — Reference-based wrappers for every SIMD load/store intrinsic across x86, ARM, and WASM. This crate closed the last unsafe gap: _mm256_loadu_ps taking *const f32 was the one thing you couldn't make safe without a wrapper. Archmage depends on it and re-exports its functions through import_intrinsics, shadowing core::arch's pointer-based versions automatically.

License

MIT OR Apache-2.0

archmage 0.9.1