hanzo-rocm-kernels 0.10.2

# hanzo-ml-rocm-kernels

ROCm/HIP kernel support for the Hanzo deep learning framework.

## Overview

This crate provides ROCm (AMD GPU) kernel support for Hanzo. Unlike CUDA which can embed PTX directly, ROCm/HIP requires ahead-of-time (AOT) compilation for specific GPU architectures.

## Architecture

### AOT Cache System

We use an **Ahead-of-Time (AOT) compilation cache** approach:

1. **Source Code**: HIP kernels are shipped as source code (`.hip` files)
2. **On-Demand Compilation**: First time a kernel is needed, it's compiled using `hipcc`
3. **Caching**: Compiled binaries are cached for reuse
4. **Future Runs**: Load cached binaries directly (no recompilation)

### Cache Location

Compiled binaries are stored at:

```
~/.cache/hanzo-ml-rocm/{arch}-{rocm_version}/
```

For example:
- `~/.cache/hanzo-ml-rocm/gfx908-6.1/binary_a1b2c3d4.cso`
- `~/.cache/hanzo-ml-rocm/gfx942-6.2/binary_a1b2c3d4.cso`

Where:
- `{arch}` = GPU architecture (gfx908, gfx90a, gfx942, etc.)
- `{rocm_version}` = ROCm version (6.0, 6.1, 6.2, etc.)
- `{hash}` = SHA256 hash of source code (first 16 chars)

### Key Components

#### CacheManager (`src/cache.rs`)

Manages the AOT compilation cache:

- **GPU Detection**: Automatically detects GPU architecture using `rocminfo` or falls back to environment variable `CANDLE_ROCM_ARCH`
- **Version Detection**: Detects ROCm version using `hipcc --version` or environment variable `CANDLE_ROCM_VERSION`
- **Compilation**: Invokes `hipcc` with appropriate flags:
  ```bash
  hipcc --offload-arch={arch} -O3 -fPIC -c -o output.o input.hip
  ```
- **Caching**: Stores compiled `.cso` (code object) files with source hash versioning

Usage:
```rust
use hanzo_rocm_kernels::CacheManager;
use rocm_rs::hip::Device;

let device = Device::new(0)?;
let cache = CacheManager::new(&device)?;
let binary = cache.get_or_compile("binary", source_code)?;
```

#### KernelManager (`src/manager.rs`)

Higher-level manager that:

- Wraps CacheManager
- Loads compiled binaries as `rocm_rs::hip::Module`
- Returns `Arc<Module>` for thread-safe sharing
- Maintains in-memory module cache

Usage:
```rust
use hanzo_rocm_kernels::KernelManager;
use hanzo_rocm_kernels::source::Source;

let device = Device::new(0)?;
let manager = KernelManager::new(&device)?;
let module = manager.get_or_compile_module(Source::Binary)?;
```

### Environment Variables

- `CANDLE_ROCM_ARCH` - Override GPU architecture detection (e.g., "gfx908")
- `CANDLE_ROCM_VERSION` - Override ROCm version detection (e.g., "6.1")

## Requirements

- ROCm/HIP installed (provides `hipcc`)
- AMD GPU with supported architecture

## Kernel Types

Currently supports:
- **Binary operations**: Add, Sub, Mul, Div, Minimum, Maximum

## Building

```bash
cd hanzo-ml-rocm-kernels
cargo build
```

Note: First build will compile dependencies. No GPU required for building, but `hipcc` must be in PATH if you want to compile kernels.

## Testing

```bash
cargo test
```

## Implementation Notes

### Why AOT instead of JIT?

The `rocm-rs` crate (v0.5) doesn't support runtime compilation. It only supports:
- `Module::load(path)` - Load from file
- `Module::load_data(bytes)` - Load from bytes

This makes JIT compilation (via hiprtc) unavailable, so we compile ahead-of-time on first run.

### Supported GPU Architectures

Common AMD GPU architectures:
- CDNA2: gfx90a (MI200 series)
- CDNA3: gfx942 (MI300 series)
- RDNA3: gfx1100, gfx1101, gfx1102 (RX 7000 series)

The system will try to auto-detect, but you can override with `CANDLE_ROCM_ARCH`.

## License

MIT OR Apache-2.0