ai-hwaccel 1.0.0

Universal AI hardware accelerator detection, capability querying, and workload planning for Rust
Documentation

ai-hwaccel

Universal AI hardware accelerator detection, capability querying, and workload planning for Rust.

ai-hwaccel gives you a single-call view of every AI-capable accelerator on the system -- GPUs, TPUs, NPUs, and cloud inference chips -- then helps you decide how to quantize and shard a model across them.

Why this crate?

ML frameworks typically hard-code support for one or two backends. If you need portable hardware discovery (e.g. for a model server, training launcher, or benchmarking harness) you end up writing the same sysfs/CLI-probing boilerplate everywhere. This crate consolidates that into a tested, vendor-neutral library with zero compile-time SDK dependencies.

Supported hardware

Family Variants Detection method
NVIDIA CUDA GeForce, Tesla, A100, H100, ... nvidia-smi on $PATH
AMD ROCm MI250, MI300, RX 7900 /sys/class/drm sysfs
Apple Metal M1--M4 GPU cores /proc/device-tree/compatible
Apple ANE Neural Engine /proc/device-tree/compatible
Intel NPU Meteor Lake+ /sys/class/misc/intel_npu
AMD XDNA Ryzen AI NPU /sys/class/accel/*/device/driver
Google TPU v4, v5e, v5p /dev/accel* + sysfs version
Intel Gaudi Gaudi 2, Gaudi 3 (Habana HPU) hl-smi on $PATH
AWS Inferentia inf1, inf2 /dev/neuron* or neuron-ls
AWS Trainium trn1 /dev/neuron* + sysfs
Qualcomm Cloud AI AI 100 /dev/qaic_* or /sys/class/qaic
Vulkan Compute Any Vulkan 1.1+ device vulkaninfo on $PATH
CPU Always present /proc/meminfo (16 GiB fallback)

Quick start

Add to your Cargo.toml:

[dependencies]
ai-hwaccel = "0.19"

Library usage

use ai_hwaccel::{AcceleratorRegistry, QuantizationLevel};

// 1. Discover hardware
let registry = AcceleratorRegistry::detect();
println!("Best device: {}", registry.best_available().unwrap());

// 2. Pick quantization for a 7 B-param model
let quant = registry.suggest_quantization(7_000_000_000);
println!("Recommended: {quant}");

// 3. Plan sharding for a 70 B model at BF16
let plan = registry.plan_sharding(70_000_000_000, &QuantizationLevel::BFloat16);
println!(
    "Strategy: {}, est. {:.0} tok/s",
    plan.strategy,
    plan.estimated_tokens_per_sec.unwrap_or(0.0)
);

CLI usage

ai-hwaccel                  # Full registry JSON to stdout
ai-hwaccel --summary        # Compact summary JSON
ai-hwaccel --version        # Print version

# Logging (logs go to stderr, data to stdout)
RUST_LOG=debug ai-hwaccel   # Verbose detection diagnostics
ai-hwaccel --json-log       # Structured JSON logs to stderr

Architecture

The crate is organized into focused modules, each with a single responsibility:

src/
├── lib.rs                  # Crate root, re-exports
├── main.rs                 # CLI binary
├── hardware/               # Device type definitions
│   ├── mod.rs              #   AcceleratorType, AcceleratorFamily
│   ├── tpu.rs              #   TpuVersion (v4/v5e/v5p)
│   ├── gaudi.rs            #   GaudiGeneration (Gaudi2/Gaudi3)
│   └── neuron.rs           #   NeuronChipType (Inferentia/Trainium)
├── profile.rs              # AcceleratorProfile (capabilities per device)
├── quantization.rs         # QuantizationLevel (FP32 → INT4)
├── requirement.rs          # AcceleratorRequirement (scheduling constraints)
├── sharding.rs             # ShardingStrategy, ModelShard, ShardingPlan
├── training.rs             # TrainingMethod, MemoryEstimate
├── registry.rs             # AcceleratorRegistry (query + suggest APIs)
├── detect/                 # Hardware detection (one file per backend)
│   ├── mod.rs              #   Orchestrator + shared helpers
│   ├── cuda.rs             #   NVIDIA via nvidia-smi
│   ├── rocm.rs             #   AMD via sysfs /sys/class/drm
│   ├── apple.rs            #   Metal + ANE via device-tree
│   ├── vulkan.rs           #   Vulkan via vulkaninfo
│   ├── intel_npu.rs        #   Intel NPU via sysfs
│   ├── amd_xdna.rs         #   AMD XDNA via sysfs
│   ├── tpu.rs              #   Google TPU via /dev/accel*
│   ├── gaudi.rs            #   Intel Gaudi via hl-smi
│   ├── neuron.rs           #   AWS Neuron via neuron-ls
│   ├── intel_oneapi.rs     #   Intel oneAPI via xpu-smi
│   └── qualcomm.rs         #   Qualcomm AI 100 via sysfs
├── plan.rs                 # Sharding planner (impl on AcceleratorRegistry)
└── tests/                  # Test suite (one file per concern)
    ├── classification.rs, display.rs, quantization.rs,
    ├── requirement.rs, registry.rs, sharding.rs,
    ├── training.rs, serde.rs

Core concepts

AcceleratorRegistry

The main entry point. detect() probes the system and returns a registry of every discovered accelerator. From there you can:

  • Query -- available(), best_available(), by_family(), satisfying()
  • Plan -- suggest_quantization(), plan_sharding()
  • Inspect -- total_memory(), total_accelerator_memory(), has_accelerator()

AcceleratorType

A 13-variant enum representing each supported device family. Provides classification helpers (is_gpu(), is_npu(), is_tpu(), is_ai_asic()) and throughput/training multipliers used by the planner.

QuantizationLevel

Five levels from full-precision FP32 down to Int4, each carrying its bits-per-parameter and memory-reduction factor.

ShardingPlan

Describes how a model should be distributed across devices:

Strategy When used
None Model fits on a single device
TensorParallel TPU pod slices (ICI mesh)
PipelineParallel Multiple GPUs or AI ASICs
DataParallel Replicas for throughput

Training memory estimation

estimate_training_memory() returns a per-component breakdown (model, optimizer, activations) for methods including full fine-tune, LoRA, QLoRA, DPO, RLHF, and distillation.

How detection works

All detection is best-effort and non-destructive:

  1. sysfs probing -- reads /sys/class/drm, /sys/class/misc, etc.
  2. /dev introspection -- checks for device nodes like /dev/accel*, /dev/neuron*, /dev/qaic_*.
  3. $PATH tool execution -- runs nvidia-smi, hl-smi, vulkaninfo, neuron-ls when present, parses their output.

If a tool or sysfs path is absent the accelerator simply isn't registered -- no errors, no panics.

Logging

The library uses tracing for all diagnostic output. No logs are emitted unless the consuming application installs a tracing subscriber. The CLI binary ships with tracing-subscriber and respects the standard RUST_LOG environment variable:

RUST_LOG=ai_hwaccel=debug   # Library-level debug messages
RUST_LOG=trace               # Everything including dependency traces

Minimum supported Rust version (MSRV)

Rust 1.89 (edition 2024). Tracked in rust-toolchain.toml.

Versioning

This crate uses semantic versioning. The 0.x series is pre-1.0 and may contain breaking changes between minor versions. The version is read from the VERSION file at the repo root and kept in sync with Cargo.toml via scripts/version-bump.sh.

Documentation

Document Description
Production guide Deployment, caching, security, containers, C FFI
Testing guide Running tests, hardware setup, CI matrix
Framework integration candle, burn, tch-rs, ort examples
Threat model Security boundaries and mitigations
JSON schema Serialized registry format specification
Architecture decisions ADRs for key design choices
Roadmap Remaining work and future plans
Contributing How to contribute
Changelog Release history

Development

make check   # fmt + clippy + test (same as CI)
make audit   # cargo-audit (vulnerabilities)
make deny    # cargo-deny (licenses, advisories)
make vet     # cargo-vet (dependency audits)
make doc     # generate rustdoc
make build   # release build

License

Licensed under the GNU Affero General Public License v3.0.

See LICENSE for the full text.