archmage
Safely invoke your intrinsic power, using the tokens granted to you by the CPU.
Zero overhead. Archmage generates identical assembly to hand-written unsafe code. The safety abstractions exist only at compile time—at runtime, you get raw SIMD instructions. Calling an #[arcane] function costs exactly the same as calling a bare #[target_feature] function directly.
[]
= "0.6"
= "0.6"
Raw intrinsics with #[arcane]
use *;
summon() checks CPUID. #[arcane] enables #[target_feature], making intrinsics safe (Rust 1.85+). The prelude re-exports safe_unaligned_simd functions directly — _mm256_loadu_ps takes &[f32; 8], not a raw pointer. Compile with -C target-cpu=haswell to elide the runtime check.
Inner helpers with #[rite]
#[rite] should be your default. Use #[arcane] only at entry points.
use *;
// Entry point: use #[arcane]
// Inner helper: use #[rite] (inlines into #[arcane] — features match)
#[rite] adds #[target_feature] + #[inline] without a wrapper function. Since Rust 1.85+, calling #[target_feature] functions from matching contexts is safe—no unsafe needed between #[arcane] and #[rite] functions.
Performance rule: Never call #[arcane] from #[arcane]. Use #[rite] for any function called exclusively from SIMD code.
Why this matters
Processing 1000 8-float vector additions (full benchmark details):
| Pattern | Time | Why |
|---|---|---|
#[rite] in #[arcane] |
547 ns | Features match — LLVM inlines |
#[arcane] per iteration |
2209 ns (4x) | Target-feature boundary per call |
Bare #[target_feature] (no archmage) |
2222 ns (4x) | Same boundary — archmage adds nothing |
The 4x penalty comes from LLVM's #[target_feature] optimization boundary, not from archmage. Bare #[target_feature] has the same cost. With real workloads (DCT-8), the boundary costs up to 6.2x. Use #[rite] for helpers called from SIMD code — it inlines into callers with matching features, eliminating the boundary.
SIMD types with magetypes
use ;
use f32x8;
f32x8 wraps __m256 with token-gated construction and natural operators.
Runtime dispatch with incant!
Write platform-specific variants with concrete types, then dispatch at runtime:
use incant;
use f32x8;
const LANES: usize = 8;
/// AVX2 path — processes 8 floats at a time.
/// Scalar fallback.
/// Public API — dispatches to the best available at runtime.
incant! looks for _v3, _v4, _neon, _wasm128, and _scalar suffixed functions, and dispatches to the best one the CPU supports. Each variant uses concrete SIMD types for its platform; the scalar fallback uses plain math.
#[magetypes] for simple cases
If your function body doesn't use SIMD types (only Token), #[magetypes] can generate the variants for you by replacing Token with the concrete token type for each platform:
use magetypes;
For functions that use platform-specific SIMD types (f32x8, f32x4, etc.), write the variants manually and use incant! as shown above.
Tokens
| Token | Alias | Features |
|---|---|---|
X64V2Token |
SSE4.2, POPCNT | |
X64V3Token |
Desktop64 |
AVX2, FMA, BMI2 |
X64V4Token |
Server64 |
AVX-512 (requires avx512 feature) |
NeonToken |
Arm64 |
NEON |
Wasm128Token |
WASM SIMD | |
ScalarToken |
Always available |
All tokens compile on all platforms. summon() returns None on unsupported architectures. Detection is cached: ~1.3 ns after first call, 0 ns with -Ctarget-cpu=haswell (compiles away).
The prelude
use archmage::prelude::* gives you:
- Tokens:
Desktop64,Arm64,ScalarToken, etc. - Traits:
SimdToken,IntoConcreteToken,HasX64V2, etc. - Macros:
#[arcane],#[rite],#[magetypes],incant! - Intrinsics:
core::arch::*for your platform - Memory ops:
safe_unaligned_simdfunctions (reference-based, no raw pointers)
Feature flags
| Feature | Default | |
|---|---|---|
std |
yes | Standard library |
macros |
yes | #[arcane], #[magetypes], incant! |
safe_unaligned_simd |
yes | Re-exports via prelude |
avx512 |
no | AVX-512 tokens |
License
MIT OR Apache-2.0