archmage 0.1.0 - Docs.rs

# archmage

[![Crates.io](https://img.shields.io/crates/v/archmage.svg)](https://crates.io/crates/archmage)
[![Documentation](https://docs.rs/archmage/badge.svg)](https://docs.rs/archmage)
[![CI](https://github.com/imazen/archmage/actions/workflows/ci.yml/badge.svg)](https://github.com/imazen/archmage/actions/workflows/ci.yml)
[![codecov](https://codecov.io/gh/imazen/archmage/branch/main/graph/badge.svg)](https://codecov.io/gh/imazen/archmage)
[![License](https://img.shields.io/crates/l/archmage.svg)](https://github.com/imazen/archmage#license)

> Safely invoke your intrinsic power, using the tokens granted to you by the CPU. Cast primitive magics faster than any mage alive.

**archmage** provides capability tokens that prove CPU feature availability at runtime, making raw SIMD intrinsics safe to call via the `#[arcane]` macro.

## Quick Start

```rust
use archmage::{Desktop64, HasAvx2, SimdToken, arcane};
use archmage::mem::avx;  // safe load/store (enabled by default)
use std::arch::x86_64::*;

#[arcane]
fn multiply_add(token: impl HasAvx2, a: &[f32; 8], b: &[f32; 8]) -> [f32; 8] {
    // Safe memory operations - references, not raw pointers!
    let va = avx::_mm256_loadu_ps(token, a);
    let vb = avx::_mm256_loadu_ps(token, b);

    // Value-based intrinsics are SAFE inside #[arcane]!
    let result = _mm256_add_ps(va, vb);
    let result = _mm256_mul_ps(result, result);

    let mut out = [0.0f32; 8];
    avx::_mm256_storeu_ps(token, &mut out, result);
    out
}

fn main() {
    // Desktop64 is the recommended starting point:
    // - AVX2 + FMA + BMI2
    // - Works on Intel Haswell (2013+) and AMD Zen 1 (2017+)
    // - Covers ~95% of desktop/server CPUs in use today
    if let Some(token) = Desktop64::summon() {
        let result = multiply_add(token, &[1.0; 8], &[2.0; 8]);
        println!("{:?}", result);
    }
}
```

## How It Works

### The Problem

Raw SIMD intrinsics have two safety concerns:

1. **Feature availability**: Calling `_mm256_add_ps` on a CPU without AVX is undefined behavior
2. **Memory safety**: `_mm256_loadu_ps(ptr)` dereferences a raw pointer

Rust 1.85+ made value-based intrinsics safe inside `#[target_feature]` functions, but calling those functions is still `unsafe` because the compiler can't verify the CPU supports the features.

### The Solution: Tokens + `#[arcane]`

archmage solves this with two components:

**1. Capability Tokens** - Zero-sized proof types created after runtime CPU detection:

```rust
use archmage::{Desktop64, SimdToken};

// summon() checks CPUID and returns Some only if features are available
// (check is elided if compiled with -C target-cpu=native or similar)
if let Some(token) = Desktop64::summon() {
    // Token exists = CPU definitely has AVX2 + FMA + BMI2
}
```

**2. The `#[arcane]` Macro** - Transforms your function to enable `#[target_feature]`:

```rust
#[arcane]
fn my_kernel(token: impl HasAvx2, data: &[f32; 8]) -> [f32; 8] {
    // Intrinsics are safe here!
    let v = _mm256_setzero_ps();
    // ...
}
```

The macro generates:

```rust
fn my_kernel(token: impl HasAvx2, data: &[f32; 8]) -> [f32; 8] {
    #[target_feature(enable = "avx2")]
    unsafe fn inner(data: &[f32; 8]) -> [f32; 8] {
        let v = _mm256_setzero_ps();  // Safe inside #[target_feature]!
        // ...
    }
    // SAFETY: The token parameter proves the caller verified CPU support
    unsafe { inner(data) }
}
```

**Why is this safe?**

1. `inner()` has `#[target_feature(enable = "avx2")]`, so Rust allows intrinsics without `unsafe`
2. Calling `inner()` is unsafe, but we know it's valid because:
   - The function requires a token parameter
   - Tokens can only be created via `summon()` which checks CPU features
   - Therefore, if you have a token, the CPU supports the features

### Generic Token Bounds

Functions accept any token that provides the required capabilities:

```rust
use archmage::{HasAvx2, HasFma, arcane};
use archmage::mem::avx;
use std::arch::x86_64::*;

// Accept any token with AVX2 (Avx2Token, Desktop64, Server64, etc.)
#[arcane]
fn double(token: impl HasAvx2, data: &[f32; 8]) -> [f32; 8] {
    let v = avx::_mm256_loadu_ps(token, data);
    let doubled = _mm256_add_ps(v, v);
    let mut out = [0.0f32; 8];
    avx::_mm256_storeu_ps(token, &mut out, doubled);
    out
}

// Require multiple features with inline bounds
#[arcane]
fn fma_kernel<T: HasAvx2 + HasFma>(token: T, a: &[f32; 8], b: &[f32; 8], c: &[f32; 8]) -> [f32; 8] {
    let va = avx::_mm256_loadu_ps(token, a);
    let vb = avx::_mm256_loadu_ps(token, b);
    let vc = avx::_mm256_loadu_ps(token, c);
    let result = _mm256_fmadd_ps(va, vb, vc);  // a * b + c
    let mut out = [0.0f32; 8];
    avx::_mm256_storeu_ps(token, &mut out, result);
    out
}

// Where clause syntax
#[arcane]
fn square<T>(token: T, data: &mut [f32; 8])
where
    T: HasAvx2
{
    let v = avx::_mm256_loadu_ps(token, data);
    let squared = _mm256_mul_ps(v, v);
    avx::_mm256_storeu_ps(token, data, squared);
}
```

The trait hierarchy means broader tokens satisfy narrower bounds:
- `Desktop64` implements `HasAvx2`, `HasFma`, `HasSse42`, etc.
- `Server64` implements everything `Desktop64` does, plus `HasAvx512f`, etc.

## Choosing a Token

**Start with `Desktop64`** - it's the sweet spot for modern x86-64:

| Token | Features | Hardware Coverage |
|-------|----------|-------------------|
| `Desktop64` | AVX2 + FMA + BMI2 | Intel Haswell 2013+, AMD Zen 1 2017+ (~95% of x86-64) |
| `Server64` | + AVX-512 | Intel Skylake-X 2017+, AMD Zen 4 2022+ |
| `X64V2Token` | SSE4.2 + POPCNT | Intel Nehalem 2008+, AMD Bulldozer 2011+ |

**For specific features:**

| Token | Use Case |
|-------|----------|
| `Avx2Token` | Need AVX2 but not FMA |
| `Avx2FmaToken` | AVX2 + FMA (most floating-point SIMD) |
| `FmaToken` | FMA only |
| `Sse2Token` | Baseline x86-64 (always available) |

**ARM tokens:**

| Token | Features | Hardware |
|-------|----------|----------|
| `NeonToken` | NEON | All AArch64 (baseline, including Apple M-series) |
| `SveToken` | SVE | Graviton 3, A64FX |
| `Sve2Token` | SVE2 | ARMv9: Graviton 4, Cortex-X2+ |

## Safe Memory Operations

With the `safe_unaligned_simd` feature, load/store uses references instead of raw pointers:

```rust
use archmage::{Desktop64, SimdToken};
use archmage::mem::avx;

if let Some(token) = Desktop64::summon() {
    let data = [1.0f32; 8];
    let v = avx::_mm256_loadu_ps(token, &data);  // Safe! Reference, not pointer

    let mut out = [0.0f32; 8];
    avx::_mm256_storeu_ps(token, &mut out, v);   // Safe!
}
```

The `mem` module wrappers accept `impl HasAvx`, `impl HasSse2`, etc., so any compatible token works.

## When to Use archmage

archmage is for when you need **specific instructions** that autovectorization won't produce:

- Complex shuffles and permutes
- Exact FMA sequences for numerical precision
- DCT butterflies and signal processing
- Gather/scatter operations
- Bit manipulation (BMI1/BMI2)

For portable SIMD without manual intrinsics, use the `wide` crate instead.

| Approach | When to Use |
|----------|-------------|
| **wide** | Portable code, let the compiler choose instructions |
| **archmage** | Need specific instructions, complex algorithms |

## Feature Flags

```toml
[dependencies]
archmage = "0.1"
```

| Feature | Description |
|---------|-------------|
| `std` (default) | Enable std library support |
| `macros` (default) | Enable `#[arcane]` macro (alias: `#[simd_fn]`) |
| `safe_unaligned_simd` (default) | Safe load/store via references (exposed as `mem` module) |

**Unstable features** (API may change):

| Feature | Description |
|---------|-------------|
| `__composite` | Higher-level ops (transpose, dot product) |
| `__wide` | Integration with the `wide` crate |

## Limitations

**Self receivers not supported in `#[arcane]`:**

```rust
// This won't work - inner functions can't have `self`
#[arcane]
fn process(&self, token: impl HasAvx2) { ... }

// Instead, take self as a regular parameter or use free functions
#[arcane]
fn process(state: &MyStruct, token: impl HasAvx2) { ... }
```

## License

MIT OR Apache-2.0

## AI-Generated Code Notice

Developed with Claude (Anthropic). Not all code manually reviewed. Review critical paths before production use.