ringkernel-cuda-codegen 0.1.0

CUDA code generation from Rust DSL for RingKernel stencil kernels
Documentation
# ringkernel-cuda-codegen

Rust-to-CUDA transpiler for RingKernel GPU kernels.

## Overview

This crate enables writing GPU kernels in a restricted Rust DSL and transpiling them to CUDA C code. It supports three kernel types:

1. **Global Kernels** - Standard CUDA `__global__` functions
2. **Stencil Kernels** - Tile-based kernels with `GridPos` abstraction
3. **Ring Kernels** - Persistent actor kernels with message loops

## Installation

```toml
[dependencies]
ringkernel-cuda-codegen = "0.1"
syn = { version = "2.0", features = ["full"] }
```

## Global Kernels

For general-purpose CUDA kernels:

```rust
use ringkernel_cuda_codegen::transpile_global_kernel;
use syn::parse_quote;

let func: syn::ItemFn = parse_quote! {
    fn saxpy(x: &[f32], y: &mut [f32], a: f32, n: i32) {
        let idx = block_idx_x() * block_dim_x() + thread_idx_x();
        if idx >= n { return; }
        y[idx as usize] = a * x[idx as usize] + y[idx as usize];
    }
};

let cuda_code = transpile_global_kernel(&func)?;
```

## Stencil Kernels

For grid-based computations with neighbor access:

```rust
use ringkernel_cuda_codegen::{transpile_stencil_kernel, StencilConfig};

let func: syn::ItemFn = parse_quote! {
    fn fdtd(p: &[f32], p_prev: &mut [f32], c2: f32, pos: GridPos) {
        let lap = pos.north(p) + pos.south(p) + pos.east(p) + pos.west(p)
                  - 4.0 * p[pos.idx()];
        p_prev[pos.idx()] = 2.0 * p[pos.idx()] - p_prev[pos.idx()] + c2 * lap;
    }
};

let config = StencilConfig::new("fdtd")
    .with_tile_size(16, 16)
    .with_halo(1);

let cuda_code = transpile_stencil_kernel(&func, &config)?;
```

## Ring Kernels

For persistent actor-model kernels:

```rust
use ringkernel_cuda_codegen::{transpile_ring_kernel, RingKernelConfig};

let handler: syn::ItemFn = parse_quote! {
    fn process(ctx: &RingContext, msg: &Request) -> Response {
        let tid = ctx.global_thread_id();
        ctx.sync_threads();
        Response { value: msg.value * 2.0, id: tid as u64 }
    }
};

let config = RingKernelConfig::new("processor")
    .with_block_size(128)
    .with_queue_capacity(1024)
    .with_hlc(true)   // Hybrid Logical Clocks
    .with_k2k(true);  // Kernel-to-kernel messaging

let cuda_code = transpile_ring_kernel(&handler, &config)?;
```

## DSL Reference

### Thread/Block Indices
- `thread_idx_x()`, `thread_idx_y()`, `thread_idx_z()`
- `block_idx_x()`, `block_idx_y()`, `block_idx_z()`
- `block_dim_x()`, `block_dim_y()`, `block_dim_z()`
- `grid_dim_x()`, `grid_dim_y()`, `grid_dim_z()`

### Stencil Intrinsics
- `pos.idx()` - Linear index
- `pos.north(buf)`, `pos.south(buf)`, `pos.east(buf)`, `pos.west(buf)`
- `pos.at(buf, dx, dy)` - Relative offset access

### Synchronization
- `sync_threads()` - Block-level barrier
- `thread_fence()` - Device memory fence
- `thread_fence_block()` - Block memory fence

### Atomics
- `atomic_add(ptr, val)`, `atomic_sub(ptr, val)`
- `atomic_min(ptr, val)`, `atomic_max(ptr, val)`
- `atomic_exchange(ptr, val)`, `atomic_cas(ptr, compare, val)`

### Math Functions
- `sqrt()`, `abs()`, `floor()`, `ceil()`, `round()`
- `sin()`, `cos()`, `tan()`, `exp()`, `log()`
- `powf()`, `min()`, `max()`, `mul_add()`

### Warp Operations
- `warp_shuffle(val, lane)`, `warp_shuffle_up(val, delta)`
- `warp_shuffle_down(val, delta)`, `warp_shuffle_xor(val, mask)`
- `warp_ballot(pred)`, `warp_all(pred)`, `warp_any(pred)`

## Type Mapping

| Rust Type | CUDA Type |
|-----------|-----------|
| `f32` | `float` |
| `f64` | `double` |
| `i32` | `int` |
| `u32` | `unsigned int` |
| `i64` | `long long` |
| `u64` | `unsigned long long` |
| `bool` | `int` |
| `&[T]` | `const T* __restrict__` |
| `&mut [T]` | `T* __restrict__` |

## Testing

```bash
cargo test -p ringkernel-cuda-codegen
```

The crate includes 143 tests covering all kernel types and language features.

## License

Apache-2.0