archmage
Browse 12,000+ SIMD Intrinsics → · Docs · Magetypes · API Docs
Archmage lets you write SIMD code in Rust without unsafe. It works on x86-64, AArch64, and WASM.
[]
= "0.9"
The problem
Raw SIMD in Rust requires unsafe for every intrinsic call:
use *;
// Every. Single. Call.
unsafe
Miss a feature check and you get undefined behavior on older CPUs. Wrap everything in unsafe and hope for the best.
The solution
use *; // tokens, traits, macros, intrinsics, safe memory ops
// X64V3Token = AVX2 + FMA (Haswell 2013+, Zen 1+)
No unsafe anywhere. Your crate can use #![forbid(unsafe_code)].
How Rust enforces SIMD safety
Rust 1.85 (Feb 2025) changed the rules for #[target_feature] functions:
Rust's #[target_feature] call rules (1.85+, 2024 edition)
┌─────────────────────────┐ ┌──────────────────────────────┐
│ fn normal_code() │ unsafe │ #[target_feature(avx2, fma)] │
│ │────────▶│ fn simd_work() │
│ (no target features) │ │ │
└─────────────────────────┘ └──────────────┬───────────────┘
│
Calling simd_work() from │ safe
normal_code() requires │ (subset of
unsafe { }. The caller ▼ caller's features)
has fewer features. ┌──────────────────────────────┐
│ #[target_feature(avx2)] │
│ fn simd_helper() │
│ │
└──────────────────────────────┘
Caller has same or superset features? Safe call. No unsafe needed.
Caller has fewer features? Rust requires an unsafe block.
Inside a #[target_feature] function, value-based intrinsics are safe — _mm256_add_ps, shuffles, compares, anything that doesn't touch a pointer. Only pointer-based memory ops (_mm256_loadu_ps(ptr)) remain unsafe.
Two gaps remain:
-
The boundary crossing. The first call from normal code into a
#[target_feature]function requiresunsafe. Someone has to verify the CPU actually has those features. -
Memory operations.
_mm256_loadu_pstakes*const f32. Raw pointers needunsafe.
How archmage closes both gaps
Archmage makes the boundary crossing sound by tying it to runtime CPU detection. You can't cross without proof.
Your crate: #![forbid(unsafe_code)]
┌────────────────────────────────────────────────────────────┐
│ │
│ 1. summon() Checks CPUID. Returns Some(token) │
│ X64V3Token only if AVX2+FMA are present. │
│ Token is zero-sized proof. │
│ │ │
│ ▼ │
│ 2. #[arcane] fn Reads token type from signature. │
│ entry(token, ..) Generates #[target_feature] sibling. │
│ Wraps the unsafe call internally — │
│ │ your code never writes unsafe. │
│ ▼ │
│ 3. #[rite] fns Same #[target_feature] as caller. │
│ + plain fns Safe calls — Rust allows it. │
│ Inline freely, no boundary. │
│ │ │
│ ▼ │
│ 4. Intrinsics Value ops: safe (Rust 1.85+) │
│ in scope Memory ops: safe_unaligned_simd │
│ takes &[f32; 8], not *const f32 │
│ │
│ Result: all intrinsics are safe. No unsafe in your code. │
└────────────────────────────────────────────────────────────┘
Tokens are grouped by common CPU tiers. X64V3Token covers AVX2+FMA+BMI2 — the set that Haswell (2013) and Zen 1+ share. NeonToken covers AArch64 NEON. Arm64V2Token covers CRC+RDM+DotProd+FP16+AES+SHA2 — the set that Apple M1, Cortex-A55+, and Graviton 2+ share. You pick a tier, not individual features. summon() checks all features in the tier atomically; it either succeeds (every feature present) or returns None. The token is zero-sized — passing it costs nothing. Detection is cached (~1.3 ns), or compiles away entirely with -Ctarget-cpu=haswell.
#[arcane] is the trampoline. It generates a sibling function with #[target_feature(enable = "avx2,fma,...")] and an #[inline(always)] wrapper that calls it through unsafe. The macro generates the unsafe block, not you. Since the token's existence proves the features are present, the call is sound. From inside the #[arcane] function, you can use intrinsics directly (value ops are safe) and call #[rite] functions with matching features (safe under Rust 1.85+). Both macros handle #[cfg(target_arch)] gating automatically.
safe_unaligned_simd (by okaneco) closes the memory gap. It shadows core::arch's pointer-based load/store functions with reference-based versions — _mm256_loadu_ps takes &[f32; 8] instead of *const f32. Same names, safe signatures. Archmage re-exports these through import_intrinsics, so the safe versions are in scope automatically.
All tokens compile on all platforms. On the wrong architecture, summon() returns None. You rarely need #[cfg(target_arch)] in your code.
Why #![forbid(unsafe_code)] matters for AI-written SIMD
AI is a patient compiler. It can write SIMD intrinsics, run benchmarks, iterate on hot loops, and try instruction sequences that no human would bother testing. Constraining AI to #![forbid(unsafe_code)] means it can't introduce undefined behavior — the type system catches unsound calls at compile time. The result is hand-tuned SIMD that's both fast and provably safe.
Import styles
use archmage::prelude::* imports everything — tokens, traits, macros, all platform intrinsics, and safe memory ops. The examples in this README use the prelude for brevity:
use *;
// intrinsics already in scope from prelude
If you don't want thousands of intrinsic names at module scope, use selective imports with import_intrinsics to scope them to the function body:
use ;
// injects intrinsics inside this function only
Both import the same combined intrinsics module — using both is just duplication. The prelude is simpler; import_intrinsics is more explicit.
Which macro do I use?
Writing a SIMD function?
┌───────────┴───────────┐
Hand-written Scalar code,
intrinsics compiler auto-vectorizes
│ │
Called from Need manual control
non-SIMD code? over dispatch?
(after summon(), main)
┌─────┴─────┐ ┌─────┴─────┐
YES NO NO YES
│ │ │ │
┌─────▼──────┐ ┌──▼────────┐ ┌───▼─────────┐ ┌▼────────────────┐
│ #[arcane] │ │ #[rite] │ │#[autoversion]│ │ #[arcane] per │
│ │ │ │ │ │ │ variant + │
│ Entry from │ │ Inlines │ │ Generates │ │ incant!() │
│ normal code│ │ into the │ │ variants + │ │ │
│ One per │ │ caller's │ │ dispatcher │ │ You write each │
│ hot path │ │ LLVM │ │ from one │ │ #[arcane] fn │
│ │ │ region │ │ scalar body │ │ and dispatch │
└─────┬──────┘ └──▲──┬────┘ └─────────────┘ └─────────────────┘
│ │ │
└───────────┘ └──▶ also calls #[rite]
#[arcane] calls and plain #[inline(always)]
#[rite] helpers fns (get caller's features)
Plain #[inline(always)] functions with no macro also work — if they inline into an #[arcane] or #[rite] caller, LLVM compiles them with the caller's features for free.
#[arcane] vs #[rite]: entry point vs internal
#[rite] should be your default. Use #[arcane] only at the entry point — the first call from non-SIMD code.
Two #[rite] syntaxes — same output
use ;
use __m256;
// Tier-based: specify the tier name, no token parameter needed
// Token-based: reads features from the token type in the signature
Both produce identical #[target_feature(enable = "avx2,fma,...")] output. The tier-based form is shorter; the token form is handy when you already have the token in scope.
#[arcane] at the entry point
use ;
#[arcane] generates an #[inline(always)] wrapper that crosses the #[target_feature] boundary via unsafe — internally, in generated code your crate never sees. From inside, you call #[rite] functions freely (same features = safe call).
Why two macros?
#[arcane] generates a safe wrapper that crosses the #[target_feature] boundary — LLVM can't optimize across it. #[rite] adds #[target_feature] + #[inline] directly, so LLVM inlines it into the caller. Same features = no boundary.
Processing 1000 8-float vector additions (full benchmark details):
| Pattern | Time | Why |
|---|---|---|
#[rite] inside #[arcane] |
547 ns | Features match — LLVM inlines |
#[arcane] per iteration |
2209 ns (4x) | Target-feature boundary per call |
Bare #[target_feature] (no archmage) |
2222 ns (4x) | Same boundary — archmage adds nothing |
The 4x penalty is LLVM's #[target_feature] boundary, not archmage overhead. Bare #[target_feature] without archmage has the same cost. With real workloads (DCT-8), the boundary costs up to 6.2x.
The rule: #[arcane] once at the entry point, #[rite] for everything called from SIMD code.
For trait impls, use #[arcane(_self = Type)] — a nested inner-function approach (since sibling would add methods not in the trait definition).
Multi-tier #[rite] (rare)
Occasionally you have scalar code called from inside SIMD functions where #[inline(always)] isn't viable and you don't want #[autoversion]'s dispatch overhead. In that case, give #[rite] multiple tiers and it generates a suffixed variant for each:
use *;
// Generates normalize_v3() and normalize_neon() —
// each compiled with different #[target_feature], auto-vectorized separately
Call the matching suffixed variant from an #[arcane] or #[rite] context:
use *;
These are #[target_feature] functions — you can't call them from non-SIMD code or dispatch them via incant!. They're for internal use within a SIMD call chain.
Auto-vectorization with #[autoversion]
Don't want to write intrinsics? Write plain scalar code and let the compiler vectorize it:
use *;
// Call directly — no token needed, no unsafe:
let result = sum_of_squares;
#[autoversion] generates a separate copy of your function for each architecture tier — each compiled with #[target_feature] to unlock the auto-vectorizer — plus a runtime dispatcher that picks the best one. On x86-64 with AVX2+FMA, that loop compiles to vfmadd231ps (8 floats per cycle). On ARM, you get fmla. The _scalar fallback compiles without SIMD features as a safety net.
The _token: SimdToken parameter is a placeholder — you don't use it in the body. The macro replaces it with concrete token types (X64V3Token, NeonToken, etc.) for each variant.
What gets generated (default tiers):
sum_of_squares_v4(token: X64V4Token, ...)— AVX-512 (withavx512feature)sum_of_squares_v3(token: X64V3Token, ...)— AVX2+FMAsum_of_squares_neon(token: NeonToken, ...)— AArch64 NEONsum_of_squares_wasm128(token: Wasm128Token, ...)— WASM SIMDsum_of_squares_scalar(token: ScalarToken, ...)— no SIMDsum_of_squares(data: &[f32]) -> f32— dispatcher (token param removed)
Explicit tiers: #[autoversion(v3, neon)]. scalar is always implicit.
For inherent methods, self works naturally — no special parameters needed. For trait method delegation, use #[autoversion(_self = MyType)] and _self in the body. See the full parameter reference or the API docs.
When to use which:
#[autoversion] |
#[arcane] + #[rite] |
|
|---|---|---|
| You write | Scalar loops | SIMD intrinsics |
| Vectorization | Compiler auto-vectorizes | You choose the instructions |
| Lines of code | 1 attribute | Manual variant + dispatch |
| Best for | Simple numeric loops | Hand-tuned SIMD kernels |
SIMD types with magetypes
magetypes provides ergonomic SIMD vector types (f32x8, i32x4, etc.) with natural Rust operators. It's an exploratory companion crate — the API may change between releases.
[]
= "0.9"
= "0.9"
use *;
use f32x8;
f32x8 wraps __m256 on x86 with AVX2. On ARM/WASM, it's polyfilled with two f32x4 operations — same API, automatic fallback. The #[arcane] wrapper lets LLVM optimize the entire loop as a single SIMD region.
Tier naming conventions
incant! and #[autoversion] dispatch to suffixed functions. incant!(sum(data)) calls sum_v3, sum_neon, etc. These suffixes correspond to tokens:
| Suffix | Token | Arch | Key features |
|---|---|---|---|
_v1 |
X64V1Token |
x86_64 | SSE2 (baseline) |
_v2 |
X64V2Token |
x86_64 | + SSE4.2, POPCNT |
_v3 |
X64V3Token |
x86_64 | + AVX2, FMA, BMI2 |
_v4 |
X64V4Token |
x86_64 | + AVX-512 |
_neon |
NeonToken |
aarch64 | NEON |
_arm_v2 |
Arm64V2Token |
aarch64 | + CRC, RDM, DotProd, AES, SHA2 |
_arm_v3 |
Arm64V3Token |
aarch64 | + SHA3, I8MM, BF16 |
_wasm128 |
Wasm128Token |
wasm32 | SIMD128 |
_scalar |
ScalarToken |
any | No SIMD (always available) |
By default, incant! tries _v4 (if the avx512 feature is enabled), _v3, _neon, _wasm128, then _scalar. You can restrict to specific tiers: incant!(sum(data), [v3, neon, scalar]). Always include scalar in explicit tier lists — it documents the fallback path. (Currently auto-appended if omitted; will become a compile error in v1.0.)
Runtime dispatch with incant! (alias: dispatch_variant!)
Write platform-specific variants with concrete types, then dispatch at runtime:
use incant;
use f32x8;
/// Dispatches to the best available at runtime.
Each variant's first parameter is the matching token type — _v3 takes X64V3Token, _neon takes NeonToken, etc.
_scalar is mandatory. incant! always emits an unconditional call to fn_scalar(ScalarToken, ...) as the final fallback. If the _scalar function doesn't exist, you get a compile error — not a runtime failure. Always include scalar in explicit tier lists (e.g., [v3, neon, scalar]) to document this dependency. Currently scalar is auto-appended if omitted; this will become a compile error in v1.0.
incant! wraps each tier's call in #[cfg(target_arch)] and #[cfg(feature)] guards, so you only define variants for architectures you target. With no explicit tier list, incant! dispatches to v3, neon, wasm128, and scalar by default (plus v4 if the avx512 feature is enabled).
Known tiers: v1, v2, x64_crypto, v3, v3_crypto, v4, v4x, arm_v2, arm_v3, neon, neon_aes, neon_sha3, neon_crc, wasm128, scalar.
If you already have a token, use with to dispatch on its concrete type: incant!(func(data) with token).
Tokens
| Token | Alias | Features | Hardware |
|---|---|---|---|
X64V1Token |
Sse2Token |
SSE, SSE2 | x86_64 baseline (always available) |
X64V2Token |
+ SSE4.2, POPCNT | Nehalem 2008+ | |
X64CryptoToken |
V2 + PCLMULQDQ, AES-NI | Westmere 2010+ | |
X64V3Token |
— | + AVX2, FMA, BMI2 | Haswell 2013+, Zen 1+ |
X64V3CryptoToken |
V3 + VPCLMULQDQ, VAES | Zen 3+ 2020, Alder Lake 2021+ | |
X64V4Token |
Server64 |
+ AVX-512 (requires avx512 feature) |
Skylake-X 2017+, Zen 4+ |
NeonToken |
Arm64 |
NEON | All 64-bit ARM |
Arm64V2Token |
+ CRC, RDM, DotProd, FP16, AES, SHA2 | A55+, M1+, Graviton 2+ | |
Arm64V3Token |
+ FHM, FCMA, SHA3, I8MM, BF16 | A510+, M2+, Snapdragon X | |
Wasm128Token |
WASM SIMD | Compile-time only | |
ScalarToken |
(none) | Always available |
Higher tokens subsume lower ones: X64V4Token → X64V3Token → X64V2Token → X64V1Token. Downcasting is free (zero-cost). #[arcane(stub)] generates unreachable stubs on non-matching architectures when you need cross-arch dispatch without #[cfg] guards. incant! handles cfg-gating automatically.
See token-registry.toml for the complete mapping of tokens to CPU features.
Testing SIMD dispatch paths
for_each_token_permutation tests every incant! dispatch path on your native hardware — no cross-compilation needed. It disables tokens one at a time, running your closure at each combination from "all SIMD enabled" down to "scalar only":
use ;
On an AVX-512 machine this runs 5–7 permutations; on Haswell, 3. Tokens the CPU doesn't have are skipped automatically.
The testable_dispatch feature
When you compile with -Ctarget-cpu=native (or -Ctarget-cpu=haswell, etc.), Rust bakes the feature checks into the binary at compile time. X64V3Token::summon() compiles to a constant Some(token) — it can't be disabled at runtime, and for_each_token_permutation can't test fallback paths.
The testable_dispatch feature forces runtime detection even when compile-time detection would succeed. Enable it in dev-dependencies for full permutation coverage:
[]
= { = "0.9", = ["testable_dispatch"] }
CompileTimePolicy controls what happens when a token can't be disabled:
| Policy | Behavior |
|---|---|
Warn |
Skip the token, collect a warning in the report |
WarnStderr |
Same, plus print each warning to stderr |
Fail |
Panic — use in CI with testable_dispatch where full coverage is expected |
Serializing parallel tests with lock_token_testing()
Token disabling is process-wide — if cargo test runs tests in parallel, one test disabling X64V3Token would break another test that expects it to be available. for_each_token_permutation acquires an internal mutex automatically. If you need to disable tokens manually (via dangerously_disable_token_process_wide), wrap your test in lock_token_testing() to serialize against other permutation tests:
use lock_token_testing;
use ;
The lock is reentrant — for_each_token_permutation called from within a lock_token_testing scope works without deadlock.
For the full testing API, see the testing docs.
Feature flags
| Feature | Default | |
|---|---|---|
std |
yes | Standard library (required for runtime detection) |
macros |
yes | No-op (macros are always available). Kept for backwards compatibility |
avx512 |
no | AVX-512 tokens (X64V4Token, X64V4xToken, Avx512Fp16Token) |
testable_dispatch |
no | Makes token disabling work with -Ctarget-cpu=native |
Acknowledgments
safe_unaligned_simdby okaneco — Reference-based wrappers for every SIMD load/store intrinsic across x86, ARM, and WASM. This crate closed the lastunsafegap:_mm256_loadu_pstaking*const f32was the one thing you couldn't make safe without a wrapper. Archmage depends on it and re-exports its functions throughimport_intrinsics, shadowingcore::arch's pointer-based versions automatically.
License
MIT OR Apache-2.0