archmage
Browse 12,000+ SIMD Intrinsics → · Docs · Magetypes · API Docs
Archmage lets you write SIMD code in Rust without unsafe. It works on x86-64, AArch64, and WASM.
[]
= "0.9"
The problem
Raw SIMD in Rust requires unsafe for every intrinsic call:
use *;
// Every. Single. Call.
unsafe
Miss a feature check and you get undefined behavior on older CPUs. Wrap everything in unsafe and hope for the best.
The solution
use *; // tokens, traits, macros, intrinsics, safe memory ops
// X64V3Token = AVX2 + FMA (Haswell 2013+, Zen 1+)
No unsafe anywhere. Your crate can use #![forbid(unsafe_code)].
What changed in Rust 1.85
Rust 1.85 (Feb 2025) stabilized target_feature_11, which changed two things:
-
#[target_feature]functions can be declared safe (Rust 2024 edition). Previously they had to beunsafe fn. Now they're justfn— but calling them from code without matching features still requiresunsafe. -
Value-based intrinsics are safe inside
#[target_feature]functions._mm256_add_ps,_mm256_mul_ps, shuffles, compares — anything that doesn't touch a pointer is safe to call when the compiler knows the CPU features are enabled. Only pointer-based operations (_mm256_loadu_ps(ptr),_mm256_storeu_ps(ptr, v)) remain unsafe.
This means the only unsafe left in SIMD code is: (a) the one call site where you cross from normal code into a #[target_feature] function, and (b) memory operations that take raw pointers. Archmage eliminates both.
How archmage makes it zero-unsafe
┌─────────── Your crate: #![forbid(unsafe_code)] ───────────┐
│ │
│ summon() ──→ Token ──→ #[arcane] fn ──→ #[rite] fn │
│ (CPUID) (proof) (entry point) (inlines freely) │
│ │ │
│ prelude / import_intrinsics │
│ ┌────────┴────────┐ │
│ │ core::arch safe_unaligned_simd │
│ │ (value ops (memory ops │
│ │ = safe) = safe) │
│ └─────────────────┘ │
│ │
│ Result: all intrinsics are safe. No unsafe in your code. │
└────────────────────────────────────────────────────────────┘
Three pieces eliminate the remaining unsafe:
-
Tokens prove the CPU has the right features.
X64V3Token::summon()checks CPUID and returnsSome(token)only if AVX2+FMA are available. The token is zero-sized — passing it around costs nothing at runtime. Detection is cached (~1.3 ns), or compiles away entirely with-Ctarget-cpu=haswell. -
#[arcane]/#[rite]read the token type from your function signature and generate#[target_feature(enable = "avx2,fma,...")]. The#[arcane]macro generates theunsafecall to cross the#[target_feature]boundary inside its own generated code — your crate never writesunsafe.#[rite]functions called from matching#[target_feature]contexts don't needunsafeat all (Rust 1.85+). Both macros handle#[cfg(target_arch)]gating automatically. -
Safe memory ops from
safe_unaligned_simd(by okaneco) shadowcore::arch's pointer-based versions._mm256_loadu_pstakes&[f32; 8]instead of*const f32. Same function names, safe signatures. These are included in the prelude, or you can use#[arcane(import_intrinsics)]to scope them to a single function.
All tokens compile on all platforms. On the wrong architecture, summon() returns None. You rarely need #[cfg(target_arch)] in your code.
Import styles
use archmage::prelude::* imports everything — tokens, traits, macros, all platform intrinsics, and safe memory ops. The examples in this README use the prelude for brevity:
use *;
// intrinsics already in scope from prelude
If you don't want thousands of intrinsic names at module scope, use selective imports with import_intrinsics to scope them to the function body:
use ;
// injects intrinsics inside this function only
Both import the same combined intrinsics module — using both is just duplication. The prelude is simpler; import_intrinsics is more explicit.
#[arcane] vs #[rite]: entry point vs internal
#[rite] should be your default. Use #[arcane] only at the entry point — the first call from non-SIMD code.
use *;
// Entry point: #[arcane] — safe wrapper for non-SIMD callers
// Called from SIMD code: #[rite] — inlines into caller, no boundary
Both macros read the token type to decide which #[target_feature] to emit. X64V3Token → avx2,fma,.... X64V4Token → avx512f,avx512bw,.... The token type is the feature selector.
Why two macros? #[arcane] generates a safe wrapper that crosses the #[target_feature] boundary — LLVM can't optimize across it. #[rite] adds #[target_feature] + #[inline] directly, so LLVM inlines it into the caller. Same token type = same features = no boundary.
Processing 1000 8-float vector additions (full benchmark details):
| Pattern | Time | Why |
|---|---|---|
#[rite] inside #[arcane] |
547 ns | Features match — LLVM inlines |
#[arcane] per iteration |
2209 ns (4x) | Target-feature boundary per call |
Bare #[target_feature] (no archmage) |
2222 ns (4x) | Same boundary — archmage adds nothing |
The 4x penalty is LLVM's #[target_feature] boundary, not archmage overhead. Bare #[target_feature] without archmage has the same cost. With real workloads (DCT-8), the boundary costs up to 6.2x.
The rule: #[arcane] once at the entry point, #[rite] for everything called from SIMD code. Pass the same token type through your call tree so features stay consistent.
For trait impls, use #[arcane(_self = Type)] — a nested inner-function approach (since sibling would add methods not in the trait definition).
Auto-vectorization with #[autoversion]
Don't want to write intrinsics? Write plain scalar code and let the compiler vectorize it:
use *;
// Call directly — no token needed, no unsafe:
let result = sum_of_squares;
#[autoversion] generates a separate copy of your function for each architecture tier — each compiled with #[target_feature] to unlock the auto-vectorizer — plus a runtime dispatcher that picks the best one. On x86-64 with AVX2+FMA, that loop compiles to vfmadd231ps (8 floats per cycle). On ARM, you get fmla. The _scalar fallback compiles without SIMD features as a safety net.
The _token: SimdToken parameter is a placeholder — you don't use it in the body. The macro replaces it with concrete token types (X64V3Token, NeonToken, etc.) for each variant.
What gets generated (default tiers):
sum_of_squares_v4(token: X64V4Token, ...)— AVX-512 (withavx512feature)sum_of_squares_v3(token: X64V3Token, ...)— AVX2+FMAsum_of_squares_neon(token: NeonToken, ...)— AArch64 NEONsum_of_squares_wasm128(token: Wasm128Token, ...)— WASM SIMDsum_of_squares_scalar(token: ScalarToken, ...)— no SIMDsum_of_squares(data: &[f32]) -> f32— dispatcher (token param removed)
Explicit tiers: #[autoversion(v3, neon)]. scalar is always implicit.
For inherent methods, self works naturally — no special parameters needed. For trait method delegation, use #[autoversion(_self = MyType)] and _self in the body. See the full parameter reference or the API docs.
When to use which:
#[autoversion] |
#[arcane] + #[rite] |
|
|---|---|---|
| You write | Scalar loops | SIMD intrinsics |
| Vectorization | Compiler auto-vectorizes | You choose the instructions |
| Lines of code | 1 attribute | Manual variant + dispatch |
| Best for | Simple numeric loops | Hand-tuned SIMD kernels |
SIMD types with magetypes
magetypes provides ergonomic SIMD vector types (f32x8, i32x4, etc.) with natural Rust operators. It's an exploratory companion crate — the API may change between releases.
[]
= "0.9"
= "0.9"
use *;
use f32x8;
f32x8 wraps __m256 on x86 with AVX2. On ARM/WASM, it's polyfilled with two f32x4 operations — same API, automatic fallback. The #[arcane] wrapper lets LLVM optimize the entire loop as a single SIMD region.
Runtime dispatch with incant! (alias: dispatch_variant!)
Write platform-specific variants with concrete types, then dispatch at runtime:
use incant;
use f32x8;
/// Dispatches to the best available at runtime.
Each variant's first parameter is the matching token type — _v3 takes X64V3Token, _neon takes NeonToken, etc. A _scalar variant (taking ScalarToken) is always required as the fallback.
incant! wraps each tier's call in #[cfg(target_arch)] and #[cfg(feature)] guards, so you only define variants for architectures you target. With no explicit tier list, incant! dispatches to v3, neon, wasm128, and scalar by default (plus v4 if the avx512 feature is enabled).
Known tiers: v1, v2, x64_crypto, v3, v3_crypto, v4, v4x, arm_v2, arm_v3, neon, neon_aes, neon_sha3, neon_crc, wasm128, scalar.
If you already have a token, use with to dispatch on its concrete type: incant!(func(data) with token).
Tokens
| Token | Alias | Features | Hardware |
|---|---|---|---|
X64V1Token |
Sse2Token |
SSE, SSE2 | x86_64 baseline (always available) |
X64V2Token |
+ SSE4.2, POPCNT | Nehalem 2008+ | |
X64CryptoToken |
V2 + PCLMULQDQ, AES-NI | Westmere 2010+ | |
X64V3Token |
— | + AVX2, FMA, BMI2 | Haswell 2013+, Zen 1+ |
X64V3CryptoToken |
V3 + VPCLMULQDQ, VAES | Zen 3+ 2020, Alder Lake 2021+ | |
X64V4Token |
Server64 |
+ AVX-512 (requires avx512 feature) |
Skylake-X 2017+, Zen 4+ |
NeonToken |
Arm64 |
NEON | All 64-bit ARM |
Arm64V2Token |
+ CRC, RDM, DotProd, FP16, AES, SHA2 | A55+, M1+, Graviton 2+ | |
Arm64V3Token |
+ FHM, FCMA, SHA3, I8MM, BF16 | A510+, M2+, Snapdragon X | |
Wasm128Token |
WASM SIMD | Compile-time only | |
ScalarToken |
(none) | Always available |
Higher tokens subsume lower ones: X64V4Token → X64V3Token → X64V2Token → X64V1Token. Downcasting is free (zero-cost). #[arcane(stub)] generates unreachable stubs on non-matching architectures when you need cross-arch dispatch without #[cfg] guards. incant! handles cfg-gating automatically.
See token-registry.toml for the complete mapping of tokens to CPU features.
Testing SIMD dispatch paths
for_each_token_permutation tests every incant! dispatch path on your native hardware — no cross-compilation needed. It disables tokens one at a time, running your closure at each tier from "all SIMD enabled" down to "scalar only":
use ;
On an AVX-512 machine this runs 5–7 permutations; on Haswell, 3. Tokens the CPU doesn't have are skipped.
If you compiled with -Ctarget-cpu=native, the compiler bakes feature detection into the binary and tokens can't be disabled at runtime. Use the testable_dispatch feature to force runtime detection in CI:
[]
= { = "0.9", = ["testable_dispatch"] }
For manual single-token testing, lock_token_testing() serializes against parallel tests. See the testing docs for CompileTimePolicy::Fail, env-var integration, and dangerously_disable_token_process_wide.
Feature flags
| Feature | Default | |
|---|---|---|
std |
yes | Standard library (required for runtime detection) |
macros |
yes | #[arcane], #[rite], #[autoversion], #[magetypes], incant! |
avx512 |
no | AVX-512 tokens (X64V4Token, X64V4xToken, Avx512Fp16Token) |
testable_dispatch |
no | Makes token disabling work with -Ctarget-cpu=native |
Acknowledgments
safe_unaligned_simdby okaneco — Reference-based wrappers for every SIMD load/store intrinsic across x86, ARM, and WASM. This crate closed the lastunsafegap:_mm256_loadu_pstaking*const f32was the one thing you couldn't make safe without a wrapper. Archmage depends on it and re-exports its functions throughimport_intrinsics, shadowingcore::arch's pointer-based versions automatically.
License
MIT OR Apache-2.0