amx-sys

Low-level AMX instruction emulation - Hardware-faithful implementation

This crate provides direct bindings to all 23 AMX (Apple Matrix eXtensions) instructions with a faithful emulation of the Apple silicon behavior.

Features

✅ All 23 AMX instructions
✅ Complete register file emulation (8 X, 8 Y, 64 Z registers)
✅ Full data type support (i8-i64, u8-u64, f32, f64)
✅ All element sizes (B8, B16, B32, B64)
✅ All shuffle modes (S0, S1, S2, S3)
✅ 100% parity with C reference implementation
✅ 100 tests (100% pass rate)

Instructions

Load/Store (8)

ldx, ldy, ldz, ldzi - Load X, Y, Z registers
stx, sty, stz, stzi - Store registers

Extract (2)

extrx - Extract from X register with shuffle
extry - Extract from Y register with shuffle

Floating-Point (6)

fma16, fma32, fma64 - Multiply-accumulate
fms16, fms32, fms64 - Multiply-subtract

Integer (2)

mac16 - Signed 16-bit multiply-accumulate
mac16_unsigned - Unsigned multiply-accumulate

Vector (2)

vecint - Vector integer operations
vecfp - Vector floating-point operations

Matrix (2)

matint - Matrix integer operations
matfp - Matrix floating-point operations

Lookup Table (1)

genlut - Generate lookup table

Usage

use amx_sys::registers::AmxState;
use amx_sys::instructions::ldst::*;
use amx_sys::instructions::fma::fma32;

let mut state = AmxState::new();

// Load data
let data = [0u8; 64];
ldx(&mut state, 0, &data);
ldy(&mut state, 0, &data);
ldz(&mut state, 0, &data);

// Perform FMA
fma32(&mut state, 0, 0, 0);

// Store result
let result = stx(&state, 0);

Testing

All 23 instructions are thoroughly tested with:

100 test cases (100% pass rate)
Randomized inputs (xoshiro256++ RNG)
All data types and element sizes
1,000+ iterations of stress testing

Run tests:

cargo test

Benchmarking

Run the comprehensive IO-vs-compute benchmark:

cargo bench -p amx-sys --bench io_vs_compute

The benchmark reports CSV-style rows with separate io_only_ns, compute_only_ns, and end_to_end_ns timings for:

Extract shuffle modes S0-S3
FMA and FMS precisions
MAC16 signed and unsigned
VECINT and MATINT for all signed and unsigned element sizes
VECFP and MATFP for all floating-point precisions
GENLUT for all supported element sizes

Useful filters:

AMX_BENCH_FILTER=matfp AMX_BENCH_SAMPLES=3 cargo bench -p amx-sys --bench io_vs_compute

Performance

All operations run in constant time
Zero heap allocations
Pure Rust with minimal unsafe code

License

MIT OR Apache-2.0

amx-sys 0.0.1