hanzo-rocm-kernels 0.10.2

ROCm/HIP kernels for Hanzo
docs.rs failed to build hanzo-rocm-kernels-0.10.2
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.

hanzo-ml-rocm-kernels

ROCm/HIP kernel support for the Hanzo deep learning framework.

Overview

This crate provides ROCm (AMD GPU) kernel support for Hanzo. Unlike CUDA which can embed PTX directly, ROCm/HIP requires ahead-of-time (AOT) compilation for specific GPU architectures.

Architecture

AOT Cache System

We use an Ahead-of-Time (AOT) compilation cache approach:

  1. Source Code: HIP kernels are shipped as source code (.hip files)
  2. On-Demand Compilation: First time a kernel is needed, it's compiled using hipcc
  3. Caching: Compiled binaries are cached for reuse
  4. Future Runs: Load cached binaries directly (no recompilation)

Cache Location

Compiled binaries are stored at:

~/.cache/hanzo-ml-rocm/{arch}-{rocm_version}/

For example:

  • ~/.cache/hanzo-ml-rocm/gfx908-6.1/binary_a1b2c3d4.cso
  • ~/.cache/hanzo-ml-rocm/gfx942-6.2/binary_a1b2c3d4.cso

Where:

  • {arch} = GPU architecture (gfx908, gfx90a, gfx942, etc.)
  • {rocm_version} = ROCm version (6.0, 6.1, 6.2, etc.)
  • {hash} = SHA256 hash of source code (first 16 chars)

Key Components

CacheManager (src/cache.rs)

Manages the AOT compilation cache:

  • GPU Detection: Automatically detects GPU architecture using rocminfo or falls back to environment variable CANDLE_ROCM_ARCH
  • Version Detection: Detects ROCm version using hipcc --version or environment variable CANDLE_ROCM_VERSION
  • Compilation: Invokes hipcc with appropriate flags:
    hipcc --offload-arch={arch} -O3 -fPIC -c -o output.o input.hip
    
  • Caching: Stores compiled .cso (code object) files with source hash versioning

Usage:

use hanzo_rocm_kernels::CacheManager;
use rocm_rs::hip::Device;

let device = Device::new(0)?;
let cache = CacheManager::new(&device)?;
let binary = cache.get_or_compile("binary", source_code)?;

KernelManager (src/manager.rs)

Higher-level manager that:

  • Wraps CacheManager
  • Loads compiled binaries as rocm_rs::hip::Module
  • Returns Arc<Module> for thread-safe sharing
  • Maintains in-memory module cache

Usage:

use hanzo_rocm_kernels::KernelManager;
use hanzo_rocm_kernels::source::Source;

let device = Device::new(0)?;
let manager = KernelManager::new(&device)?;
let module = manager.get_or_compile_module(Source::Binary)?;

Environment Variables

  • CANDLE_ROCM_ARCH - Override GPU architecture detection (e.g., "gfx908")
  • CANDLE_ROCM_VERSION - Override ROCm version detection (e.g., "6.1")

Requirements

  • ROCm/HIP installed (provides hipcc)
  • AMD GPU with supported architecture

Kernel Types

Currently supports:

  • Binary operations: Add, Sub, Mul, Div, Minimum, Maximum

Building

cd hanzo-ml-rocm-kernels
cargo build

Note: First build will compile dependencies. No GPU required for building, but hipcc must be in PATH if you want to compile kernels.

Testing

cargo test

Implementation Notes

Why AOT instead of JIT?

The rocm-rs crate (v0.5) doesn't support runtime compilation. It only supports:

  • Module::load(path) - Load from file
  • Module::load_data(bytes) - Load from bytes

This makes JIT compilation (via hiprtc) unavailable, so we compile ahead-of-time on first run.

Supported GPU Architectures

Common AMD GPU architectures:

  • CDNA2: gfx90a (MI200 series)
  • CDNA3: gfx942 (MI300 series)
  • RDNA3: gfx1100, gfx1101, gfx1102 (RX 7000 series)

The system will try to auto-detect, but you can override with CANDLE_ROCM_ARCH.

License

MIT OR Apache-2.0