# archmage
[](https://crates.io/crates/archmage)
[](https://docs.rs/archmage)
[](https://github.com/imazen/archmage/actions/workflows/ci.yml)
[](https://codecov.io/gh/imazen/archmage)
[](https://github.com/imazen/archmage#license)
> Safely invoke your intrinsic power, using the tokens granted to you by the CPU. Cast primitive magics faster than any mage alive.
**archmage** provides capability tokens that prove CPU feature availability at runtime, making raw SIMD intrinsics safe to call via the `#[arcane]` macro.
## Quick Start
```rust
use archmage::{Desktop64, HasAvx2, SimdToken, arcane};
use archmage::mem::avx; // safe load/store (enabled by default)
use std::arch::x86_64::*;
#[arcane]
fn multiply_add(token: impl HasAvx2, a: &[f32; 8], b: &[f32; 8]) -> [f32; 8] {
// Safe memory operations - references, not raw pointers!
let va = avx::_mm256_loadu_ps(token, a);
let vb = avx::_mm256_loadu_ps(token, b);
// Value-based intrinsics are SAFE inside #[arcane]!
let result = _mm256_add_ps(va, vb);
let result = _mm256_mul_ps(result, result);
let mut out = [0.0f32; 8];
avx::_mm256_storeu_ps(token, &mut out, result);
out
}
fn main() {
// Desktop64 is the recommended starting point:
// - AVX2 + FMA + BMI2
// - Works on Intel Haswell (2013+) and AMD Zen 1 (2017+)
// - Covers ~95% of desktop/server CPUs in use today
if let Some(token) = Desktop64::summon() {
let result = multiply_add(token, &[1.0; 8], &[2.0; 8]);
println!("{:?}", result);
}
}
```
## How It Works
### The Problem
Raw SIMD intrinsics have two safety concerns:
1. **Feature availability**: Calling `_mm256_add_ps` on a CPU without AVX is undefined behavior
2. **Memory safety**: `_mm256_loadu_ps(ptr)` dereferences a raw pointer
Rust 1.85+ made value-based intrinsics safe inside `#[target_feature]` functions, but calling those functions is still `unsafe` because the compiler can't verify the CPU supports the features.
### The Solution: Tokens + `#[arcane]`
archmage solves this with two components:
**1. Capability Tokens** - Zero-sized proof types created after runtime CPU detection:
```rust
use archmage::{Desktop64, SimdToken};
// summon() checks CPUID and returns Some only if features are available
// (check is elided if compiled with -C target-cpu=native or similar)
if let Some(token) = Desktop64::summon() {
// Token exists = CPU definitely has AVX2 + FMA + BMI2
}
```
**2. The `#[arcane]` Macro** - Transforms your function to enable `#[target_feature]`:
```rust
#[arcane]
fn my_kernel(token: impl HasAvx2, data: &[f32; 8]) -> [f32; 8] {
// Intrinsics are safe here!
let v = _mm256_setzero_ps();
// ...
}
```
The macro generates:
```rust
fn my_kernel(token: impl HasAvx2, data: &[f32; 8]) -> [f32; 8] {
#[target_feature(enable = "avx2")]
unsafe fn inner(data: &[f32; 8]) -> [f32; 8] {
let v = _mm256_setzero_ps(); // Safe inside #[target_feature]!
// ...
}
// SAFETY: The token parameter proves the caller verified CPU support
unsafe { inner(data) }
}
```
**Why is this safe?**
1. `inner()` has `#[target_feature(enable = "avx2")]`, so Rust allows intrinsics without `unsafe`
2. Calling `inner()` is unsafe, but we know it's valid because:
- The function requires a token parameter
- Tokens can only be created via `summon()` which checks CPU features
- Therefore, if you have a token, the CPU supports the features
### Generic Token Bounds
Functions accept any token that provides the required capabilities:
```rust
use archmage::{HasAvx2, HasFma, arcane};
use archmage::mem::avx;
use std::arch::x86_64::*;
// Accept any token with AVX2 (Avx2Token, Desktop64, Server64, etc.)
#[arcane]
fn double(token: impl HasAvx2, data: &[f32; 8]) -> [f32; 8] {
let v = avx::_mm256_loadu_ps(token, data);
let doubled = _mm256_add_ps(v, v);
let mut out = [0.0f32; 8];
avx::_mm256_storeu_ps(token, &mut out, doubled);
out
}
// Require multiple features with inline bounds
#[arcane]
fn fma_kernel<T: HasAvx2 + HasFma>(token: T, a: &[f32; 8], b: &[f32; 8], c: &[f32; 8]) -> [f32; 8] {
let va = avx::_mm256_loadu_ps(token, a);
let vb = avx::_mm256_loadu_ps(token, b);
let vc = avx::_mm256_loadu_ps(token, c);
let result = _mm256_fmadd_ps(va, vb, vc); // a * b + c
let mut out = [0.0f32; 8];
avx::_mm256_storeu_ps(token, &mut out, result);
out
}
// Where clause syntax
#[arcane]
fn square<T>(token: T, data: &mut [f32; 8])
where
T: HasAvx2
{
let v = avx::_mm256_loadu_ps(token, data);
let squared = _mm256_mul_ps(v, v);
avx::_mm256_storeu_ps(token, data, squared);
}
```
The trait hierarchy means broader tokens satisfy narrower bounds:
- `Desktop64` implements `HasAvx2`, `HasFma`, `HasSse42`, etc.
- `Server64` implements everything `Desktop64` does, plus `HasAvx512f`, etc.
## Choosing a Token
**Start with `Desktop64`** - it's the sweet spot for modern x86-64:
| `Desktop64` | AVX2 + FMA + BMI2 | Intel Haswell 2013+, AMD Zen 1 2017+ (~95% of x86-64) |
| `Server64` | + AVX-512 | Intel Skylake-X 2017+, AMD Zen 4 2022+ |
| `X64V2Token` | SSE4.2 + POPCNT | Intel Nehalem 2008+, AMD Bulldozer 2011+ |
**For specific features:**
| `Avx2Token` | Need AVX2 but not FMA |
| `Avx2FmaToken` | AVX2 + FMA (most floating-point SIMD) |
| `FmaToken` | FMA only |
| `Sse2Token` | Baseline x86-64 (always available) |
**ARM tokens:**
| `NeonToken` | NEON | All AArch64 (baseline, including Apple M-series) |
| `SveToken` | SVE | Graviton 3, A64FX |
| `Sve2Token` | SVE2 | ARMv9: Graviton 4, Cortex-X2+ |
## Cross-Architecture Tokens
All token types are available on all architectures. This makes cross-platform code easier to write without `#[cfg]` guards everywhere:
```rust
use archmage::{Desktop64, NeonToken, SimdToken};
// This compiles on ARM, x86, WASM - no #[cfg] needed!
fn process_data(data: &mut [f32]) {
if let Some(token) = Desktop64::summon() {
// AVX2 path (only succeeds on x86 with AVX2)
process_x86(token, data);
} else if let Some(token) = NeonToken::summon() {
// NEON path (only succeeds on AArch64)
process_arm(token, data);
} else {
// Scalar fallback
process_scalar(data);
}
}
```
- `summon()` returns `None` on unsupported architectures
- Rust's type system ensures intrinsic methods don't exist on the wrong arch
- You get compile errors if you try to use x86 intrinsics in ARM code
## Safe Memory Operations (`mem` module)
The `mem` module (enabled by default) provides safe load/store using references instead of raw pointers:
```rust
use archmage::{Desktop64, SimdToken};
use archmage::mem::avx;
if let Some(token) = Desktop64::summon() {
let data = [1.0f32; 8];
let v = avx::_mm256_loadu_ps(token, &data); // Safe! Reference, not pointer
let mut out = [0.0f32; 8];
avx::_mm256_storeu_ps(token, &mut out, v); // Safe!
}
```
**Available submodules:**
| `mem::sse` | `_mm_loadu_ps`, `_mm_storeu_ps`, etc. | `impl HasSse` |
| `mem::sse2` | `_mm_loadu_pd`, `_mm_loadu_si128`, etc. | `impl HasSse2` |
| `mem::avx` | `_mm256_loadu_ps`, `_mm256_storeu_ps`, etc. | `impl HasAvx` |
| `mem::avx2` | `_mm256_loadu_si256`, etc. | `impl HasAvx2` |
| `mem::avx512f` | `_mm512_loadu_ps`, etc. | `impl HasAvx512f` |
| `mem::neon` | `vld1q_f32`, `vst1q_f32`, etc. | `impl HasNeon` |
The wrappers accept any compatible token (e.g., `Desktop64` works with `mem::avx` because it implements `HasAvx`).
## When to Use archmage
archmage is for when you need **specific instructions** that autovectorization won't produce:
- Complex shuffles and permutes
- Exact FMA sequences for numerical precision
- DCT butterflies and signal processing
- Gather/scatter operations
- Bit manipulation (BMI1/BMI2)
For portable SIMD without manual intrinsics, use the `wide` crate instead.
| **wide** | Portable code, let the compiler choose instructions |
| **archmage** | Need specific instructions, complex algorithms |
## Feature Flags
```toml
[dependencies]
archmage = "0.1"
```
| `std` (default) | Enable std library support |
| `macros` (default) | Enable `#[arcane]` macro (alias: `#[simd_fn]`) |
| `safe_unaligned_simd` (default) | Safe load/store via references (exposed as `mem` module) |
**Unstable features** (API may change):
| `__composite` | Higher-level ops (transpose, dot product) |
| `__wide` | Integration with the `wide` crate |
### Testing Scalar Fallbacks
Set the `ARCHMAGE_DISABLE` environment variable to force scalar code paths:
```bash
ARCHMAGE_DISABLE=1 cargo test
ARCHMAGE_DISABLE=1 cargo run --release
```
```rust
// With ARCHMAGE_DISABLE set, this always takes the fallback path
if let Some(token) = Desktop64::summon() {
simd_path(token, &mut data);
} else {
scalar_fallback(&mut data); // Always runs with ARCHMAGE_DISABLE
}
```
## Methods with Self Receivers
Methods with `self`, `&self`, `&mut self` receivers are supported via the `_self = Type` argument.
Use `_self` in the function body instead of `self`:
```rust
use archmage::{HasAvx2, arcane};
trait SimdOps {
fn double(&self, token: impl HasAvx2) -> Self;
fn scale(&mut self, token: impl HasAvx2, factor: f32);
}
impl SimdOps for [f32; 8] {
#[arcane(_self = [f32; 8])]
fn double(&self, _token: impl HasAvx2) -> Self {
// Use _self instead of self in the body
let v = unsafe { _mm256_loadu_ps(_self.as_ptr()) };
let doubled = _mm256_add_ps(v, v);
let mut out = [0.0f32; 8];
unsafe { _mm256_storeu_ps(out.as_mut_ptr(), doubled) };
out
}
#[arcane(_self = [f32; 8])]
fn scale(&mut self, _token: impl HasAvx2, factor: f32) {
let v = unsafe { _mm256_loadu_ps(_self.as_ptr()) };
let scale = _mm256_set1_ps(factor);
let scaled = _mm256_mul_ps(v, scale);
unsafe { _mm256_storeu_ps(_self.as_mut_ptr(), scaled) };
}
}
```
**Why `_self`?** The macro generates an inner function where `self` becomes a regular
parameter named `_self`. Using `_self` in your code reminds you that you're not using
the normal `self` keyword.
All receiver types are supported: `self` (move), `&self` (ref), `&mut self` (mut ref)
## License
MIT OR Apache-2.0
## AI-Generated Code Notice
Developed with Claude (Anthropic). Not all code manually reviewed. Review critical paths before production use.