archmage
Safely invoke your intrinsic power, using the tokens granted to you by the CPU.
Zero overhead. Archmage generates identical assembly to hand-written unsafe code. The safety abstractions exist only at compile time—at runtime, you get raw SIMD instructions. Calling an #[arcane] function costs exactly the same as calling a bare #[target_feature] function directly.
[]
= "0.8"
= "0.8"
Raw intrinsics with #[arcane]
use *;
summon() checks CPUID. #[arcane] enables #[target_feature], making intrinsics safe (Rust 1.85+). The prelude re-exports safe_unaligned_simd functions directly — _mm256_loadu_ps takes &[f32; 8], not a raw pointer. Compile with -C target-cpu=haswell to elide the runtime check.
Inner helpers with #[rite]
#[rite] should be your default. Use #[arcane] only at entry points.
use *;
// Entry point: use #[arcane]
// Inner helper: use #[rite] (inlines into #[arcane] — features match)
Both macros read the token type from your function signature to decide which #[target_feature] to emit. Desktop64 → avx2,fma,.... X64V4Token → avx512f,avx512bw,.... The token type is the feature selector.
#[arcane] generates a wrapper: an outer function that calls an inner #[target_feature] function via unsafe. This is how you cross into SIMD code without writing unsafe yourself — but the wrapper creates an LLVM optimization boundary. #[rite] applies #[target_feature] + #[inline] directly, with no wrapper and no boundary. Since Rust 1.85+, calling #[target_feature] functions from matching contexts is safe — no unsafe needed between #[arcane] and #[rite] functions.
#[rite] should be your default. Use #[arcane] only at the entry point (the first call from non-SIMD code), and #[rite] for everything inside. Passing the same token type through your call hierarchy keeps every function compiled with matching features, so LLVM inlines freely.
The cost of mismatched features
Processing 1000 8-float vector additions (full benchmark details):
| Pattern | Time | Why |
|---|---|---|
#[rite] in #[arcane] |
547 ns | Features match — LLVM inlines |
#[arcane] per iteration |
2209 ns (4x) | Target-feature boundary per call |
Bare #[target_feature] (no archmage) |
2222 ns (4x) | Same boundary — archmage adds nothing |
The 4x penalty comes from LLVM's #[target_feature] optimization boundary, not from archmage. Bare #[target_feature] has the same cost. With real workloads (DCT-8), the boundary costs up to 6.2x.
Use #[rite] for helpers called from SIMD code. When the token type matches, #[rite] emits the same #[target_feature] as the caller, so LLVM inlines freely — no boundary. The token flows through your call tree, keeping features consistent everywhere it goes.
SIMD types with magetypes
use ;
use f32x8;
f32x8 wraps __m256 with token-gated construction and natural operators.
Runtime dispatch with incant!
Write platform-specific variants with concrete types, then dispatch at runtime:
use incant;
use f32x8;
const LANES: usize = 8;
/// AVX2 path — processes 8 floats at a time.
/// Scalar fallback.
/// Public API — dispatches to the best available at runtime.
incant! looks for _v3, _v4, _neon, _wasm128, and _scalar suffixed functions by default, and dispatches to the best one the CPU supports. Each variant uses concrete SIMD types for its platform; the scalar fallback uses plain math.
You can specify explicit tiers to control which variants are dispatched to:
// Only dispatch to v1, v3, neon, and scalar
Known tiers: v1, v2, v3, v4, v4x, arm_v2, arm_v3, neon, neon_aes, neon_sha3, neon_crc, wasm128, scalar. The scalar tier is always included implicitly.
#[magetypes] for simple cases
If your function body doesn't use SIMD types (only Token), #[magetypes] can generate the variants for you by replacing Token with the concrete token type for each platform:
use magetypes;
Specify explicit tiers to control which variants are generated:
For functions that use platform-specific SIMD types (f32x8, f32x4, etc.), write the variants manually and use incant! as shown above.
Tokens
| Token | Alias | Features |
|---|---|---|
X64V1Token |
Sse2Token |
SSE, SSE2 (x86_64 baseline — always available) |
X64V2Token |
SSE4.2, POPCNT | |
X64V3Token |
Desktop64 |
AVX2, FMA, BMI2 |
X64V4Token |
Server64 |
AVX-512 (requires avx512 feature) |
NeonToken |
Arm64 |
NEON |
Arm64V2Token |
+ CRC, RDM, DotProd, FP16, AES, SHA2 (A55+, M1+) | |
Arm64V3Token |
+ FHM, FCMA, SHA3, I8MM, BF16 (A510+, M2+, Snapdragon X) | |
Wasm128Token |
WASM SIMD | |
ScalarToken |
Always available |
All tokens compile on all platforms. summon() returns None on unsupported architectures. Detection is cached: ~1.3 ns after first call, 0 ns with -Ctarget-cpu=haswell (compiles away).
The prelude
use archmage::prelude::* gives you:
- Tokens:
Desktop64,Arm64,Arm64V2Token,Arm64V3Token,ScalarToken, etc. - Traits:
SimdToken,IntoConcreteToken,HasX64V2, etc. - Macros:
#[arcane],#[rite],#[magetypes],incant! - Intrinsics:
core::arch::*for your platform - Memory ops:
safe_unaligned_simdfunctions (reference-based, no raw pointers)
Testing SIMD dispatch paths
Every incant! dispatch and if let Some(token) = summon() branch creates a fallback path. You can test all of them on your native hardware — no cross-compilation needed.
Exhaustive permutation testing
for_each_token_permutation runs your closure once for every unique combination of token tiers, from "all SIMD enabled" down to "scalar only". It handles the disable/re-enable lifecycle, mutex serialization, cascade logic, and deduplication.
use ;
On an AVX-512 machine, this runs 5–7 permutations (all enabled → AVX-512 only → AVX2+FMA → SSE4.2 → scalar). On a Haswell-era CPU without AVX-512, 3 permutations. Tokens the CPU doesn't have are skipped — they'd produce duplicate states.
Token disabling is process-wide, so run with --test-threads=1:
CompileTimePolicy and -Ctarget-cpu
If you compiled with -Ctarget-cpu=native, the compiler bakes feature detection into the binary. summon() returns Some unconditionally, and tokens can't be disabled at runtime — the runtime check was compiled out.
The CompileTimePolicy enum controls what happens when for_each_token_permutation encounters these undisableable tokens:
Warn— Exclude the token from permutations silently. Warnings are collected in the report.WarnStderr— Same, but also prints each warning to stderr with actionable fix instructions.Fail— Panic with the exact compiler flags needed to fix it.
For full coverage in CI, use the testable_dispatch feature. This makes compiled_with() return None even when features are baked in, so summon() uses runtime detection and tokens can be disabled:
# In your CI test configuration
[]
= { = "0.7", = ["testable_dispatch"] }
Enforcing full coverage via env var
Wire an environment variable to switch between Warn in local development and Fail in CI:
use ;
Then in CI (with testable_dispatch enabled):
ARCHMAGE_FULL_PERMUTATIONS=1
If a token is still compile-time guaranteed (you forgot the feature or have stale RUSTFLAGS), Fail panics with the exact flags to fix it:
x86-64-v3: compile-time guaranteed, excluded from permutations. To include it, either:
1. Add `testable_dispatch` to archmage features in Cargo.toml
2. Remove `-Ctarget-cpu` from RUSTFLAGS
3. Compile with RUSTFLAGS="-Ctarget-feature=-avx2,-fma,-bmi1,-bmi2,-f16c,-lzcnt"
Manual single-token disable
For targeted tests that only need to disable one token:
use ;
Disabling cascades downward: disabling V2 also disables V3/V4/Modern/Fp16; disabling NEON also disables Aes/Sha3/Crc.
Disabling all SIMD at once
dangerously_disable_tokens_except_wasm(true) disables all SIMD tokens in one call:
use dangerously_disable_tokens_except_wasm;
// Force scalar-only execution for benchmarking
dangerously_disable_tokens_except_wasm.unwrap;
let scalar_result = my_simd_function;
dangerously_disable_tokens_except_wasm.unwrap;
This disables V2 on x86 (cascading to V3/V4/Modern/Fp16) and NEON on ARM (cascading to Aes/Sha3/Crc). V1 (Sse2Token) is not disabled — SSE2 is the x86_64 baseline and can't be meaningfully turned off at runtime. WASM is excluded because simd128 is always a compile-time decision.
Feature flags
| Feature | Default | |
|---|---|---|
std |
yes | Standard library |
macros |
yes | #[arcane], #[magetypes], incant! |
safe_unaligned_simd |
yes | Re-exports via prelude |
avx512 |
no | AVX-512 tokens |
testable_dispatch |
no | Makes token disabling work with -Ctarget-cpu=native |
License
MIT OR Apache-2.0