axonml-core

Overview

axonml-core is the foundational layer of the AxonML machine learning framework. It provides the Device abstraction, the Scalar/Numeric/Float trait hierarchy, reference-counted Storage<T> with pooled GPU allocations, and five compute backends (CPU, CUDA, Vulkan, Metal, WebGPU) that underpin every tensor operation in the framework.

Features

Device Abstraction - Device enum (Cpu, Cuda, Vulkan, Metal, Wgpu) with per-variant device index, runtime availability checks, and best_available_backend() selector (CUDA > Metal > Vulkan > WebGPU > CPU).
Type-Safe Data Types - DType runtime enum covering F16, F32, F64, I8, I16, I32, I64, U8, U32, U64, Bool with size_of / is_float / is_signed / is_integer queries. Compile-time Scalar / Numeric / Float trait hierarchy for zero-cost generic dispatch.
Reference-Counted Storage - Storage<T> wraps either a host Vec<T> or a PooledCudaSlice behind Arc<RwLock<...>>. Supports zero-copy views via offset+len slicing, to_device() for CPU<->GPU transfer, deep copy, and RAII as_slice() / as_slice_mut() guards.
Five Compute Backends - CPU (rayon-parallel, matrixmultiply GEMM/GEMV, always available), CUDA (cuBLAS + 15+ custom PTX kernel modules), Vulkan (ash + gpu-allocator, SPIR-V compute), Metal (Apple Silicon, compute pipelines), WebGPU (wgpu for browser/cross-platform).
GPU Memory Pool - cuda_pool returns freed CUDA allocations to a size-bucketed free list instead of calling cudaFree, amortising allocator cost across training steps.
Device Capabilities - DeviceCapabilities exposes name, total/available memory, f16/f64 support, max threads per block, and CUDA compute capability.
Allocator Trait - Allocator extension point with a DefaultAllocator that performs 64-byte-aligned host allocations and reports system memory via sysinfo.

Modules

Module	Description
`device`	`Device` enum (Cpu, Cuda, Vulkan, Metal, Wgpu) + `DeviceCapabilities` with availability and capability queries
`dtype`	`DType` runtime enum and `Scalar` / `Numeric` / `Float` trait hierarchy; `F16Wrapper` and `BoolWrapper` adapters
`storage`	Reference-counted `Storage<T>` with zero-copy views, device transfer, and pooled GPU slices
`allocator`	`Allocator` trait and `DefaultAllocator` (64-byte-aligned CPU allocator)
`backends`	`Backend` trait, `BackendType`, `GpuMemory`, `GpuStream`, plus CPU/CUDA/Vulkan/Metal/WGPU implementations
`error`	`Error` / `Result` types for shape mismatches, device errors, and allocation failures

Backends (under `backends/`)

Backend	File	Status
CPU	`cpu.rs`	Always compiled; rayon-parallel ops, matrixmultiply GEMM
CUDA	`cuda.rs` + `cuda_kernels/` + `cuda_pool.rs`	Feature `cuda`; cuBLAS + PTX kernels for elementwise, activations, attention, Q4_K/Q6_K dequant-in-shader matmul, softmax, layernorm, RMSNorm, transpose, embedding gather
cuDNN	`cudnn_ops.rs`	Feature `cudnn`; conv2d forward/backward via cuDNN
Vulkan	`vulkan.rs`	Feature `vulkan`; ash + gpu-allocator, full buffer/pipeline/dispatch (~982 lines)
Metal	`metal.rs`	Feature `metal`; full buffer/pipeline/dispatch on Apple Silicon (~769 lines)
WebGPU	`wgpu_backend.rs`	Feature `wgpu`; full buffer/pipeline/dispatch via wgpu (~1710 lines)

Cargo Features

Feature	Pulls In	Purpose
`std` (default)	—	Standard library support
`cuda`	`cudarc`	NVIDIA CUDA backend
`cudnn`	`cuda` + cudarc cuDNN	cuDNN conv ops
`vulkan`	`ash`, `gpu-allocator`	Vulkan compute backend
`metal`	`metal`, `objc` (macOS only)	Apple Metal backend
`wgpu`	`wgpu`, `pollster`	WebGPU / cross-platform backend

Usage

Add this to your Cargo.toml:

[dependencies]
axonml-core = "0.6.1"

Basic Example

use axonml_core::{Device, DType, Storage};

// Check device availability
let device = Device::Cpu;
assert!(device.is_available());

// Create storage on CPU
let storage = Storage::<f32>::zeros(1024, device);
assert_eq!(storage.len(), 1024);

// Create storage from data
let data = vec![1.0f32, 2.0, 3.0, 4.0];
let storage = Storage::from_vec(data, Device::Cpu);

// Create a view (zero-copy slice)
let view = storage.slice(1, 2).unwrap();
assert_eq!(view.len(), 2);

Device Capabilities

use axonml_core::Device;

let device = Device::Cpu;
let caps = device.capabilities();

println!("Device: {}", caps.name);
println!("Total Memory: {} bytes", caps.total_memory);
println!("Supports f16: {}", caps.supports_f16);
println!("Supports f64: {}", caps.supports_f64);

Data Types

use axonml_core::{DType, Scalar, Numeric, Float};

// Query dtype properties
assert!(DType::F32.is_float());
assert_eq!(DType::F32.size_of(), 4);

// Use type traits
fn process<T: Float>(data: &[T]) -> T {
    data.iter().fold(T::ZERO, |acc, &x| acc + x)
}

Picking a Backend

use axonml_core::backends::{best_available_backend, gpu_count, BackendType};

let backend = best_available_backend();
match backend {
    BackendType::Cpu => println!("Falling back to CPU"),
    _ => println!("Using {} GPU(s)", gpu_count()),
}

Tests

Run the test suite:

cargo test -p axonml-core

License

Licensed under either of:

Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)

at your option.

Last updated: 2026-04-16 (v0.6.1)

axonml-core 0.6.2