archmage
Browse 12,000+ SIMD Intrinsics → · Docs · Magetypes · API Docs
Safely invoke your intrinsic power, using the tokens granted to you by the CPU.
Zero overhead. Archmage generates identical assembly to hand-written unsafe code. The safety abstractions exist only at compile time—at runtime, you get raw SIMD instructions. Calling an #[arcane] function costs exactly the same as calling a bare #[target_feature] function directly.
Zero unsafe. Crates using archmage + magetypes + safe_unaligned_simd are required to use* #![forbid(unsafe_code)]. There is no reason to write unsafe in SIMD code anymore.
[]
= "0.8"
= "0.8"
Raw intrinsics with #[arcane] (alias: #[token_target_features_boundary])
use *;
summon() checks CPUID. #[arcane] enables #[target_feature], making intrinsics safe (Rust 1.85+). The prelude re-exports safe_unaligned_simd functions directly — _mm256_loadu_ps takes &[f32; 8], not a raw pointer. Compile with -C target-cpu=haswell to elide the runtime check.
Inner helpers with #[rite] (alias: #[token_target_features])
#[rite] should be your default. Use #[arcane] only at entry points.
use *;
// Entry point: use #[arcane]
// Inner helper: use #[rite] (inlines into #[arcane] — features match)
Both macros read the token type from your function signature to decide which #[target_feature] to emit. Desktop64 → avx2,fma,.... X64V4Token → avx512f,avx512bw,.... The token type is the feature selector.
#[arcane] generates a wrapper: an outer function that calls an inner #[target_feature] function via unsafe. This is how you cross into SIMD code without writing unsafe yourself — but the wrapper creates an LLVM optimization boundary. #[rite] applies #[target_feature] + #[inline] directly, with no wrapper and no boundary. Since Rust 1.85+, calling #[target_feature] functions from matching contexts is safe — no unsafe needed between #[arcane] and #[rite] functions.
#[rite] should be your default. Use #[arcane] only at the entry point (the first call from non-SIMD code), and #[rite] for everything inside. Passing the same token type through your call hierarchy keeps every function compiled with matching features, so LLVM inlines freely.
The cost of mismatched features
Processing 1000 8-float vector additions (full benchmark details):
| Pattern | Time | Why |
|---|---|---|
#[rite] in #[arcane] |
547 ns | Features match — LLVM inlines |
#[arcane] per iteration |
2209 ns (4x) | Target-feature boundary per call |
Bare #[target_feature] (no archmage) |
2222 ns (4x) | Same boundary — archmage adds nothing |
The 4x penalty comes from LLVM's #[target_feature] optimization boundary, not from archmage. Bare #[target_feature] has the same cost. With real workloads (DCT-8), the boundary costs up to 6.2x.
Use #[rite] for helpers called from SIMD code. When the token type matches, #[rite] emits the same #[target_feature] as the caller, so LLVM inlines freely — no boundary. The token flows through your call tree, keeping features consistent everywhere it goes.
SIMD types with magetypes
use ;
use f32x8;
f32x8 wraps __m256 with token-gated construction and natural operators.
Runtime dispatch with incant! (alias: dispatch_variant!)
Write platform-specific variants with concrete types, then dispatch at runtime:
use incant;
use f32x8;
const LANES: usize = 8;
/// AVX2 path — processes 8 floats at a time.
/// Scalar fallback — always required.
/// Public API — dispatches to the best available at runtime.
Each variant's first parameter is the matching token type — _v3 takes X64V3Token, _neon takes NeonToken, etc. A _scalar variant (taking ScalarToken) is always required. incant! calls the best variant the CPU supports, falling back to _scalar.
What you need to provide
incant! wraps each tier's call in #[cfg(target_arch)] and #[cfg(feature)] guards, so you only need to define variants for architectures you target. The example above uses [v3], so it only needs _v3 (x86-64) and _scalar.
With no explicit tier list, incant! dispatches to v3, neon, wasm128, and scalar by default (plus v4 if the avx512 feature is enabled):
// Requires on x86-64: sum_squares_v3, sum_squares_scalar
// (+ sum_squares_v4 if `avx512` feature is enabled)
// Requires on aarch64: sum_squares_neon, sum_squares_scalar
// Requires on wasm32: sum_squares_wasm128, sum_squares_scalar
Each architecture only sees its own tier references at compile time. A crate that builds for all three platforms needs all four variants (v3, neon, wasm128, scalar); a crate that only targets x86-64 needs just v3 and scalar.
Explicit tiers
Specify exactly which tiers to try:
// Requires: sum_squares_v1, sum_squares_v3, sum_squares_neon, sum_squares_scalar
Scalar is always appended implicitly. Known tiers: v1, v2, x64_crypto, v3, v3_crypto, v4, v4x, arm_v2, arm_v3, neon, neon_aes, neon_sha3, neon_crc, wasm128, scalar.
Passthrough mode
If you already have a token (e.g., inside a generic function), use with to dispatch on its concrete type instead of summoning a new one:
ARM compute tiers and f16
The default tiers skip ARM compute tiers, but arm_v2 adds useful features for half-precision and fixed-point workloads. Arm64V2Token covers M1+, Graviton 2+, and all post-2017 ARM chips, adding FP16, rounding doubling multiply (RDM), CRC, AES, and SHA2.
For f16 specifically: X64V3Token includes F16C (hardware f32↔f16 conversion, 4 stable intrinsics). Arm64V2Token includes FP16 with 95 stable intrinsics (conversion, division, FMA) and 115 more on nightly. Use explicit tiers to dispatch to both:
// x86-64: f32_to_f16_v3(X64V3Token, ...) — F16C hardware
// aarch64: f32_to_f16_arm_v2(Arm64V2Token, ...) — NEON FP16 hardware
// all: f32_to_f16_scalar(ScalarToken, ...) — bit manipulation fallback
The scalar fallback covers WASM and any platform without hardware f16.
#[magetypes] for simple cases
If your function body doesn't use SIMD types (only Token), #[magetypes] can generate the variants for you by replacing Token with the concrete token type for each platform:
use magetypes;
Specify explicit tiers to control which variants are generated:
For functions that use platform-specific SIMD types (f32x8, f32x4, etc.), write the variants manually and use incant! as shown above.
Tokens
| Token | Alias | Features |
|---|---|---|
X64V1Token |
Sse2Token |
SSE, SSE2 (x86_64 baseline — always available) |
X64V2Token |
SSE4.2, POPCNT | |
X64CryptoToken |
V2 + PCLMULQDQ, AES-NI (Westmere 2010+) | |
X64V3Token |
Desktop64 |
AVX2, FMA, BMI2 |
X64V3CryptoToken |
V3 + VPCLMULQDQ, VAES (Zen 3+ 2020, Alder Lake 2021+) | |
X64V4Token |
Server64 |
AVX-512 (requires avx512 feature) |
NeonToken |
Arm64 |
NEON |
Arm64V2Token |
+ CRC, RDM, DotProd, FP16, AES, SHA2 (A55+, M1+) | |
Arm64V3Token |
+ FHM, FCMA, SHA3, I8MM, BF16 (A510+, M2+, Snapdragon X) | |
Wasm128Token |
WASM SIMD | |
ScalarToken |
Always available |
All tokens compile on all platforms. summon() returns None on unsupported architectures. Detection is cached: ~1.3 ns after first call, 0 ns with -Ctarget-cpu=haswell (compiles away).
See token-registry.toml for the complete mapping of tokens to CPU features.
Safety model
Archmage's safety rests on three pillars, all enabled by Rust 1.85+:
-
Value-based SIMD intrinsics are safe inside
#[target_feature]functions. Arithmetic, shuffle, compare, and bitwise operations need nounsafe. Only pointer-based memory operations remain unsafe. -
Calling a
#[target_feature]function from another function with matching features is safe. Nounsafeneeded between#[arcane]and#[rite]functions — LLVM knows the features match. -
safe_unaligned_simdmakes memory operations safe. It shadows pointer-based load/store intrinsics with reference-based alternatives (e.g.,_mm256_loadu_pstakes&[f32; 8]instead of*const f32).
Together, these mean your crate should use #![forbid(unsafe_code)]. The unsafe lives inside archmage's generated wrappers, not in your code. If you find yourself writing unsafe in a crate that uses archmage, something has gone wrong.
The prelude
use archmage::prelude::* gives you:
- Tokens:
Desktop64,Arm64,Arm64V2Token,Arm64V3Token,ScalarToken, etc. - Traits:
SimdToken,IntoConcreteToken,HasX64V2, etc. - Macros:
#[arcane],#[rite],#[magetypes],incant! - Intrinsics:
core::arch::*for your platform - Memory ops:
safe_unaligned_simdfunctions (reference-based, no raw pointers)
Testing SIMD dispatch paths
Every incant! dispatch and if let Some(token) = summon() branch creates a fallback path. You can test all of them on your native hardware — no cross-compilation needed.
Exhaustive permutation testing
for_each_token_permutation runs your closure once for every unique combination of token tiers, from "all SIMD enabled" down to "scalar only". It handles the disable/re-enable lifecycle, mutex serialization, cascade logic, and deduplication.
use ;
On an AVX-512 machine, this runs 5–7 permutations (all enabled → AVX-512 only → AVX2+FMA → SSE4.2 → scalar). On a Haswell-era CPU without AVX-512, 3 permutations. Tokens the CPU doesn't have are skipped — they'd produce duplicate states.
Token disabling is process-wide, so run with --test-threads=1:
CompileTimePolicy and -Ctarget-cpu
If you compiled with -Ctarget-cpu=native, the compiler bakes feature detection into the binary. summon() returns Some unconditionally, and tokens can't be disabled at runtime — the runtime check was compiled out.
The CompileTimePolicy enum controls what happens when for_each_token_permutation encounters these undisableable tokens:
Warn— Exclude the token from permutations silently. Warnings are collected in the report.WarnStderr— Same, but also prints each warning to stderr with actionable fix instructions.Fail— Panic with the exact compiler flags needed to fix it.
For full coverage in CI, use the testable_dispatch feature. This makes compiled_with() return None even when features are baked in, so summon() uses runtime detection and tokens can be disabled:
# In your CI test configuration
[]
= { = "0.8", = ["testable_dispatch"] }
Enforcing full coverage via env var
Wire an environment variable to switch between Warn in local development and Fail in CI:
use ;
Then in CI (with testable_dispatch enabled):
ARCHMAGE_FULL_PERMUTATIONS=1
If a token is still compile-time guaranteed (you forgot the feature or have stale RUSTFLAGS), Fail panics with the exact flags to fix it:
x86-64-v3: compile-time guaranteed, excluded from permutations. To include it, either:
1. Add `testable_dispatch` to archmage features in Cargo.toml
2. Remove `-Ctarget-cpu` from RUSTFLAGS
3. Compile with RUSTFLAGS="-Ctarget-feature=-avx2,-fma,-bmi1,-bmi2,-f16c,-lzcnt"
Manual single-token disable
For targeted tests that only need to disable one token:
use ;
Disabling cascades downward: disabling V2 also disables V3/V4/Modern/Fp16; disabling NEON also disables Aes/Sha3/Crc.
Disabling all SIMD at once
dangerously_disable_tokens_except_wasm(true) disables all SIMD tokens in one call:
use dangerously_disable_tokens_except_wasm;
// Force scalar-only execution for benchmarking
dangerously_disable_tokens_except_wasm.unwrap;
let scalar_result = my_simd_function;
dangerously_disable_tokens_except_wasm.unwrap;
This disables V2 on x86 (cascading to V3/V4/Modern/Fp16) and NEON on ARM (cascading to Aes/Sha3/Crc). V1 (Sse2Token) is not disabled — SSE2 is the x86_64 baseline and can't be meaningfully turned off at runtime. WASM is excluded because simd128 is always a compile-time decision.
Feature flags
| Feature | Default | |
|---|---|---|
std |
yes | Standard library |
macros |
yes | #[arcane], #[magetypes], incant! |
safe_unaligned_simd |
yes | Re-exports via prelude |
avx512 |
no | AVX-512 tokens |
testable_dispatch |
no | Makes token disabling work with -Ctarget-cpu=native |
License
MIT OR Apache-2.0
* OK, #![forbid(unsafe_code)] isn't technically enforced by archmage. But with #[arcane]/#[rite] handling #[target_feature], safe_unaligned_simd handling memory ops, and Rust 1.85+ making value intrinsics safe — there's genuinely nothing left that needs unsafe in your SIMD code. If your crate uses archmage and still has unsafe blocks, that's a code smell, not a necessity.