Attribute Macro autoversion

Source

#[autoversion]

Expand description

Let the compiler auto-vectorize scalar code for each architecture.

Write a plain scalar function with a SimdToken placeholder parameter. #[autoversion] generates architecture-specific copies — each compiled with different #[target_feature] flags via #[arcane] — plus a runtime dispatcher that calls the best one the CPU supports.

You don’t touch intrinsics, don’t import SIMD types, don’t think about lane widths. The compiler’s auto-vectorizer does the work; you give it permission via #[target_feature], which #[autoversion] handles.

§The simple win

use archmage::SimdToken;

#[autoversion]
fn sum_of_squares(_token: SimdToken, data: &[f32]) -> f32 {
    let mut sum = 0.0f32;
    for &x in data {
        sum += x * x;
    }
    sum
}

// Call directly — no token, no unsafe:
let result = sum_of_squares(&my_data);

The _token parameter is never used in the body. It exists so the macro knows where to substitute concrete token types. Each generated variant gets #[arcane] → #[target_feature(enable = "avx2,fma,...")], which unlocks the compiler’s auto-vectorizer for that feature set.

On x86-64 with the _v3 variant (AVX2+FMA), that loop compiles to vfmadd231ps — fused multiply-add on 8 floats per cycle. On aarch64 with NEON, you get fmla. The _scalar fallback compiles without any SIMD target features, as a safety net for unknown hardware.

§Chunks + remainder

The classic data-processing pattern works naturally:

#[autoversion]
fn normalize(_token: SimdToken, data: &mut [f32], scale: f32) {
    // Compiler auto-vectorizes this — no manual SIMD needed.
    // On v3, this becomes vdivps + vmulps on 8 floats at a time.
    for x in data.iter_mut() {
        *x = (*x - 128.0) * scale;
    }
}

If you want explicit control over chunk boundaries (e.g., for accumulator patterns), that works too:

#[autoversion]
fn dot_product(_token: SimdToken, a: &[f32], b: &[f32]) -> f32 {
    let n = a.len().min(b.len());
    let mut sum = 0.0f32;
    for i in 0..n {
        sum += a[i] * b[i];
    }
    sum
}

The compiler decides the chunk size based on the target features of each variant (8 floats for AVX2, 4 for NEON, 1 for scalar).

§What gets generated

With default tiers, #[autoversion] fn process(_t: SimdToken, data: &[f32]) -> f32 expands to:

process_v4(token: X64V4Token, ...) — AVX-512 (behind #[cfg(feature = "avx512")])
process_v3(token: X64V3Token, ...) — AVX2+FMA
process_neon(token: NeonToken, ...) — aarch64 NEON
process_wasm128(token: Wasm128Token, ...) — WASM SIMD
process_scalar(token: ScalarToken, ...) — no SIMD, always available
process(data: &[f32]) -> f32 — dispatcher (SimdToken param removed)

Each non-scalar variant is wrapped in #[arcane] (for #[target_feature]) and #[cfg(target_arch = ...)]. The dispatcher does runtime CPU feature detection via Token::summon() and calls the best match. When compiled with -C target-cpu=native, the detection is elided by the compiler.

The suffixed variants are private sibling functions — only the dispatcher is public. Within the same module, you can call them directly for testing or benchmarking.

§SimdToken replacement

#[autoversion] replaces the SimdToken type annotation in the function signature with the concrete token type for each variant (e.g., archmage::X64V3Token). Only the parameter’s type changes — the function body is never reparsed, which keeps compile times low.

The token variable (whatever you named it — token, _token, _t) keeps working in the body because its type comes from the signature. So f32x8::from_array(token, ...) works — token is now an X64V3Token which satisfies the same trait bounds as SimdToken.

#[magetypes] takes a different approach: it replaces the text Token everywhere in the function — signature and body — via string substitution. Use #[magetypes] when you need body-level type substitution (e.g., Token-dependent constants or type aliases that differ per variant). Use #[autoversion] when you want compiler auto-vectorization of scalar code with zero boilerplate.

§Benchmarking

Measure the speedup with a side-by-side comparison. The generated _scalar variant serves as the baseline; the dispatcher picks the best available:

use criterion::{Criterion, black_box, criterion_group, criterion_main};
use archmage::SimdToken;

#[autoversion]
fn sum_squares(_token: SimdToken, data: &[f32]) -> f32 {
    data.iter().map(|&x| x * x).fold(0.0f32, |a, b| a + b)
}

fn bench(c: &mut Criterion) {
    let data: Vec<f32> = (0..4096).map(|i| i as f32 * 0.01).collect();
    let mut group = c.benchmark_group("sum_squares");

    // Dispatched — picks best available at runtime
    group.bench_function("dispatched", |b| {
        b.iter(|| sum_squares(black_box(&data)))
    });

    // Scalar baseline — no target_feature, no auto-vectorization
    group.bench_function("scalar", |b| {
        b.iter(|| sum_squares_scalar(archmage::ScalarToken, black_box(&data)))
    });

    // Specific tier (useful for isolating which tier wins)
    #[cfg(target_arch = "x86_64")]
    if let Some(t) = archmage::X64V3Token::summon() {
        group.bench_function("v3_avx2_fma", |b| {
            b.iter(|| sum_squares_v3(t, black_box(&data)));
        });
    }

    group.finish();
}

criterion_group!(benches, bench);
criterion_main!(benches);

For a tight numeric loop on x86-64, the _v3 variant (AVX2+FMA) typically runs 4-8x faster than _scalar because #[target_feature] unlocks auto-vectorization that the baseline build can’t use.

§Explicit tiers

#[autoversion(v3, v4, v4x, neon, arm_v2, wasm128)]
fn process(_token: SimdToken, data: &[f32]) -> f32 {
    // ...
}

scalar is always included implicitly.

Default tiers (when no list given): v4, v3, neon, wasm128, scalar.

Known tiers: v1, v2, v3, v3_crypto, v4, v4x, neon, neon_aes, neon_sha3, neon_crc, arm_v2, arm_v3, wasm128, wasm128_relaxed, x64_crypto, scalar.

§Methods with self receivers

For inherent methods, self works naturally — no _self needed:

impl ImageBuffer {
    #[autoversion]
    fn normalize(&mut self, token: SimdToken, gamma: f32) {
        for pixel in &mut self.data {
            *pixel = (*pixel / 255.0).powf(gamma);
        }
    }
}

// Call normally — no token:
buffer.normalize(2.2);

All receiver types work: self, &self, &mut self. Non-scalar variants get #[arcane] (sibling mode), where self/Self resolve naturally.

§Trait methods (requires `_self = Type`)

Trait methods can’t use #[autoversion] directly because proc macro attributes on trait impl items can’t expand to multiple sibling functions. Use the delegation pattern with _self = Type:

trait Processor {
    fn process(&self, data: &[f32]) -> f32;
}

impl Processor for MyType {
    fn process(&self, data: &[f32]) -> f32 {
        self.process_impl(data) // delegate to autoversioned method
    }
}

impl MyType {
    #[autoversion(_self = MyType)]
    fn process_impl(&self, token: SimdToken, data: &[f32]) -> f32 {
        _self.weights.iter().zip(data).map(|(w, d)| w * d).sum()
    }
}

_self = Type uses nested mode in #[arcane], which is required for trait impls. Use _self (not self) in the body when using this form.

§Comparison with `#[magetypes]` + `incant!`

	`#[autoversion]`	`#[magetypes]` + `incant!`
Placeholder	`SimdToken`	`Token`
Generates variants	Yes	Yes (magetypes)
Generates dispatcher	Yes	No (you write `incant!`)
Best for	Scalar auto-vectorization	Explicit SIMD with typed vectors
Lines of code	1 attribute	2+ (magetypes + incant + arcane)

Use #[autoversion] for scalar loops you want auto-vectorized. Use #[magetypes] + incant! when you need f32x8, u8x32, and hand-tuned SIMD code per architecture

autoversion

Attribute Macro autoversion Copy item path