# archmage
[](https://crates.io/crates/archmage)
[](https://docs.rs/archmage)
[](https://github.com/imazen/archmage/actions/workflows/ci.yml)
[](https://codecov.io/gh/imazen/archmage)
[](https://github.com/imazen/archmage#license)
> Safely invoke your intrinsic power, using the tokens granted to you by the CPU. Cast primitive magics faster than any mage alive.
**archmage** provides capability tokens that prove CPU feature availability at runtime, making raw SIMD intrinsics safe to call via the `#[arcane]` macro.
## Quick Start
```rust
use archmage::{Desktop64, HasAvx2, SimdToken, arcane};
use archmage::mem::avx; // safe load/store (enabled by default)
use std::arch::x86_64::*;
#[arcane]
fn multiply_add(token: impl HasAvx2, a: &[f32; 8], b: &[f32; 8]) -> [f32; 8] {
// Safe memory operations - references, not raw pointers!
let va = avx::_mm256_loadu_ps(token, a);
let vb = avx::_mm256_loadu_ps(token, b);
// Value-based intrinsics are SAFE inside #[arcane]!
let result = _mm256_add_ps(va, vb);
let result = _mm256_mul_ps(result, result);
let mut out = [0.0f32; 8];
avx::_mm256_storeu_ps(token, &mut out, result);
out
}
fn main() {
// Desktop64 is the recommended starting point:
// - AVX2 + FMA + BMI2
// - Works on Intel Haswell (2013+) and AMD Zen 1 (2017+)
// - Covers ~95% of desktop/server CPUs in use today
if let Some(token) = Desktop64::summon() {
let result = multiply_add(token, &[1.0; 8], &[2.0; 8]);
println!("{:?}", result);
}
}
```
## How It Works
### The Problem
Raw SIMD intrinsics have two safety concerns:
1. **Feature availability**: Calling `_mm256_add_ps` on a CPU without AVX is undefined behavior
2. **Memory safety**: `_mm256_loadu_ps(ptr)` dereferences a raw pointer
Rust 1.85+ made value-based intrinsics safe inside `#[target_feature]` functions, but calling those functions is still `unsafe` because the compiler can't verify the CPU supports the features.
### The Solution: Tokens + `#[arcane]`
archmage solves this with two components:
**1. Capability Tokens** - Zero-sized proof types created after runtime CPU detection:
```rust
use archmage::{Desktop64, SimdToken};
// summon() checks CPUID and returns Some only if features are available
// (check is elided if compiled with -C target-cpu=native or similar)
if let Some(token) = Desktop64::summon() {
// Token exists = CPU definitely has AVX2 + FMA + BMI2
}
```
**2. The `#[arcane]` Macro** - Transforms your function to enable `#[target_feature]`:
```rust
#[arcane]
fn my_kernel(token: impl HasAvx2, data: &[f32; 8]) -> [f32; 8] {
// Intrinsics are safe here!
let v = _mm256_setzero_ps();
// ...
}
```
The macro generates:
```rust
fn my_kernel(token: impl HasAvx2, data: &[f32; 8]) -> [f32; 8] {
#[target_feature(enable = "avx2")]
unsafe fn inner(data: &[f32; 8]) -> [f32; 8] {
let v = _mm256_setzero_ps(); // Safe inside #[target_feature]!
// ...
}
// SAFETY: The token parameter proves the caller verified CPU support
unsafe { inner(data) }
}
```
**Why is this safe?**
1. `inner()` has `#[target_feature(enable = "avx2")]`, so Rust allows intrinsics without `unsafe`
2. Calling `inner()` is unsafe, but we know it's valid because:
- The function requires a token parameter
- Tokens can only be created via `summon()` which checks CPU features
- Therefore, if you have a token, the CPU supports the features
### Generic Token Bounds
Functions accept any token that provides the required capabilities:
```rust
use archmage::{HasAvx2, HasFma, arcane};
use archmage::mem::avx;
use std::arch::x86_64::*;
// Accept any token with AVX2 (Avx2Token, Desktop64, Server64, etc.)
#[arcane]
fn double(token: impl HasAvx2, data: &[f32; 8]) -> [f32; 8] {
let v = avx::_mm256_loadu_ps(token, data);
let doubled = _mm256_add_ps(v, v);
let mut out = [0.0f32; 8];
avx::_mm256_storeu_ps(token, &mut out, doubled);
out
}
// Require multiple features with inline bounds
#[arcane]
fn fma_kernel<T: HasAvx2 + HasFma>(token: T, a: &[f32; 8], b: &[f32; 8], c: &[f32; 8]) -> [f32; 8] {
let va = avx::_mm256_loadu_ps(token, a);
let vb = avx::_mm256_loadu_ps(token, b);
let vc = avx::_mm256_loadu_ps(token, c);
let result = _mm256_fmadd_ps(va, vb, vc); // a * b + c
let mut out = [0.0f32; 8];
avx::_mm256_storeu_ps(token, &mut out, result);
out
}
// Where clause syntax
#[arcane]
fn square<T>(token: T, data: &mut [f32; 8])
where
T: HasAvx2
{
let v = avx::_mm256_loadu_ps(token, data);
let squared = _mm256_mul_ps(v, v);
avx::_mm256_storeu_ps(token, data, squared);
}
```
The trait hierarchy means broader tokens satisfy narrower bounds:
- `Desktop64` implements `HasAvx2`, `HasFma`, `HasSse42`, etc.
- `Server64` implements everything `Desktop64` does, plus `HasAvx512f`, etc.
## Choosing a Token
**Start with `Desktop64`** - it's the sweet spot for modern x86-64:
| `Desktop64` | AVX2 + FMA + BMI2 | Intel Haswell 2013+, AMD Zen 1 2017+ (~95% of x86-64) |
| `Server64` | + AVX-512 | Intel Skylake-X 2017+, AMD Zen 4 2022+ |
| `X64V2Token` | SSE4.2 + POPCNT | Intel Nehalem 2008+, AMD Bulldozer 2011+ |
**For specific features:**
| `Avx2Token` | Need AVX2 but not FMA |
| `Avx2FmaToken` | AVX2 + FMA (most floating-point SIMD) |
| `FmaToken` | FMA only |
| `Sse2Token` | Baseline x86-64 (always available) |
**ARM tokens:**
| `NeonToken` | NEON | All AArch64 (baseline, including Apple M-series) |
| `SveToken` | SVE | Graviton 3, A64FX |
| `Sve2Token` | SVE2 | ARMv9: Graviton 4, Cortex-X2+ |
## Safe Memory Operations
With the `safe_unaligned_simd` feature, load/store uses references instead of raw pointers:
```rust
use archmage::{Desktop64, SimdToken};
use archmage::mem::avx;
if let Some(token) = Desktop64::summon() {
let data = [1.0f32; 8];
let v = avx::_mm256_loadu_ps(token, &data); // Safe! Reference, not pointer
let mut out = [0.0f32; 8];
avx::_mm256_storeu_ps(token, &mut out, v); // Safe!
}
```
The `mem` module wrappers accept `impl HasAvx`, `impl HasSse2`, etc., so any compatible token works.
## When to Use archmage
archmage is for when you need **specific instructions** that autovectorization won't produce:
- Complex shuffles and permutes
- Exact FMA sequences for numerical precision
- DCT butterflies and signal processing
- Gather/scatter operations
- Bit manipulation (BMI1/BMI2)
For portable SIMD without manual intrinsics, use the `wide` crate instead.
| **wide** | Portable code, let the compiler choose instructions |
| **archmage** | Need specific instructions, complex algorithms |
## Feature Flags
```toml
[dependencies]
archmage = "0.1"
```
| `std` (default) | Enable std library support |
| `macros` (default) | Enable `#[arcane]` macro (alias: `#[simd_fn]`) |
| `safe_unaligned_simd` (default) | Safe load/store via references (exposed as `mem` module) |
**Unstable features** (API may change):
| `__composite` | Higher-level ops (transpose, dot product) |
| `__wide` | Integration with the `wide` crate |
## Limitations
**Self receivers not supported in `#[arcane]`:**
```rust
// This won't work - inner functions can't have `self`
#[arcane]
fn process(&self, token: impl HasAvx2) { ... }
// Instead, take self as a regular parameter or use free functions
#[arcane]
fn process(state: &MyStruct, token: impl HasAvx2) { ... }
```
## License
MIT OR Apache-2.0
## AI-Generated Code Notice
Developed with Claude (Anthropic). Not all code manually reviewed. Review critical paths before production use.