ringkernel-cuda-codegen

Rust-to-CUDA transpiler for RingKernel GPU kernels.

Overview

This crate enables writing GPU kernels in a restricted Rust DSL and transpiling them to CUDA C code. It supports three kernel types:

Global Kernels - Standard CUDA __global__ functions
Stencil Kernels - Tile-based kernels with GridPos abstraction (2D and 3D)
Ring Kernels - Persistent actor kernels with message loops

Installation

[dependencies]
ringkernel-cuda-codegen = "0.2"
syn = { version = "2.0", features = ["full"] }

Global Kernels

For general-purpose CUDA kernels:

use ringkernel_cuda_codegen::transpile_global_kernel;
use syn::parse_quote;

let func: syn::ItemFn = parse_quote! {
    fn saxpy(x: &[f32], y: &mut [f32], a: f32, n: i32) {
        let idx = block_idx_x() * block_dim_x() + thread_idx_x();
        if idx >= n { return; }
        y[idx as usize] = a * x[idx as usize] + y[idx as usize];
    }
};

let cuda_code = transpile_global_kernel(&func)?;

Stencil Kernels

For grid-based computations with neighbor access (2D and 3D):

use ringkernel_cuda_codegen::{transpile_stencil_kernel, StencilConfig, Grid};

// 2D stencil
let func: syn::ItemFn = parse_quote! {
    fn fdtd(p: &[f32], p_prev: &mut [f32], c2: f32, pos: GridPos) {
        let lap = pos.north(p) + pos.south(p) + pos.east(p) + pos.west(p)
                  - 4.0 * p[pos.idx()];
        p_prev[pos.idx()] = 2.0 * p[pos.idx()] - p_prev[pos.idx()] + c2 * lap;
    }
};

let config = StencilConfig::new("fdtd")
    .with_grid(Grid::Grid2D)
    .with_tile_size(16, 16)
    .with_halo(1);

let cuda_code = transpile_stencil_kernel(&func, &config)?;

// 3D stencil with up/down neighbors
let func_3d: syn::ItemFn = parse_quote! {
    fn laplacian_3d(p: &[f32], out: &mut [f32], pos: GridPos) {
        let lap = pos.north(p) + pos.south(p) + pos.east(p) + pos.west(p)
                  + pos.up(p) + pos.down(p) - 6.0 * p[pos.idx()];
        out[pos.idx()] = lap;
    }
};

let config_3d = StencilConfig::new("laplacian")
    .with_grid(Grid::Grid3D)
    .with_tile_size(8, 8)
    .with_halo(1);

Ring Kernels

For persistent actor-model kernels:

use ringkernel_cuda_codegen::{transpile_ring_kernel, RingKernelConfig};

let handler: syn::ItemFn = parse_quote! {
    fn process(ctx: &RingContext, msg: &Request) -> Response {
        let tid = ctx.global_thread_id();
        ctx.sync_threads();
        Response { value: msg.value * 2.0, id: tid as u64 }
    }
};

let config = RingKernelConfig::new("processor")
    .with_block_size(128)
    .with_queue_capacity(1024)
    .with_hlc(true)           // Hybrid Logical Clocks
    .with_k2k(true)           // Kernel-to-kernel messaging
    .with_envelope_format(true)  // MessageEnvelope serialization
    .with_kernel_id(1)        // Kernel ID for routing
    .with_hlc_node_id(1);     // HLC node ID

let cuda_code = transpile_ring_kernel(&handler, &config)?;

Envelope Format

When with_envelope_format(true) is enabled, messages use a standardized MessageEnvelope format:

// MessageHeader (256 bytes, cache-aligned)
typedef struct __align__(256) {
    uint32_t magic;           // 0xCAFEBABE
    uint32_t version;         // Protocol version
    uint64_t type_id;         // Message type identifier
    uint64_t envelope_id;     // Unique envelope ID
    uint64_t correlation_id;  // Request/response correlation
    uint64_t source_kernel;   // Source kernel ID
    uint64_t target_kernel;   // Target kernel ID
    uint64_t hlc_wall;        // HLC wall clock
    uint64_t hlc_logical;     // HLC logical counter
    uint32_t hlc_node;        // HLC node ID
    uint32_t priority;        // Message priority
    uint32_t payload_size;    // Payload size in bytes
    uint32_t flags;           // Message flags
    uint8_t reserved[168];    // Padding to 256 bytes
} MessageHeader;

// MessageEnvelope = header + payload

This enables:

Automatic HLC timestamp propagation across kernels
Kernel-to-kernel routing with source/target tracking
Correlation ID for request/response patterns
Priority-based message ordering

DSL Reference

Thread/Block Indices

thread_idx_x(), thread_idx_y(), thread_idx_z() → threadIdx.x/y/z
block_idx_x(), block_idx_y(), block_idx_z() → blockIdx.x/y/z
block_dim_x(), block_dim_y(), block_dim_z() → blockDim.x/y/z
grid_dim_x(), grid_dim_y(), grid_dim_z() → gridDim.x/y/z
warp_size() → warpSize

Stencil Intrinsics (2D)

pos.idx() - Linear index
pos.north(buf), pos.south(buf) - Y-axis neighbors
pos.east(buf), pos.west(buf) - X-axis neighbors
pos.at(buf, dx, dy) - Relative offset access

Stencil Intrinsics (3D)

pos.up(buf), pos.down(buf) - Z-axis neighbors
pos.at(buf, dx, dy, dz) - 3D relative offset access

Synchronization

sync_threads() → __syncthreads() - Block-level barrier
sync_threads_count(pred) → __syncthreads_count() - Count threads with predicate
sync_threads_and(pred) → __syncthreads_and() - AND of predicate
sync_threads_or(pred) → __syncthreads_or() - OR of predicate
thread_fence() → __threadfence() - Device memory fence
thread_fence_block() → __threadfence_block() - Block memory fence
thread_fence_system() → __threadfence_system() - System memory fence

Atomic Operations (Integer)

atomic_add(ptr, val) → atomicAdd
atomic_sub(ptr, val) → atomicSub
atomic_min(ptr, val) → atomicMin
atomic_max(ptr, val) → atomicMax
atomic_exchange(ptr, val) → atomicExch
atomic_cas(ptr, compare, val) → atomicCAS
atomic_and(ptr, val) → atomicAnd
atomic_or(ptr, val) → atomicOr
atomic_xor(ptr, val) → atomicXor
atomic_inc(ptr, val) → atomicInc (increment with wrap)
atomic_dec(ptr, val) → atomicDec (decrement with wrap)

Basic Math Functions

sqrt(), rsqrt() - Square root, reciprocal sqrt
abs(), fabs() - Absolute value
floor(), ceil(), round(), trunc() - Rounding
fma(), mul_add() - Fused multiply-add
fmin(), fmax() - Minimum, maximum
fmod(), remainder() - Modulo operations
copysign() - Copy sign
cbrt() - Cube root
hypot() - Hypotenuse

Trigonometric Functions

sin(), cos(), tan() - Basic trig
asin(), acos(), atan(), atan2() - Inverse trig
sincos() - Combined sine and cosine
sinpi(), cospi() - Sin/cos of π*x

Hyperbolic Functions

sinh(), cosh(), tanh() - Hyperbolic
asinh(), acosh(), atanh() - Inverse hyperbolic

Exponential and Logarithmic Functions

exp(), exp2(), exp10(), expm1() - Exponentials
log(), ln(), log2(), log10(), log1p() - Logarithms
pow(), powf(), powi() - Power
ldexp(), scalbn() - Load/scale exponent
ilogb() - Extract exponent
erf(), erfc(), erfinv(), erfcinv() - Error functions
lgamma(), tgamma() - Gamma functions

Classification Functions

is_nan(), isnan() → isnan
is_infinite(), isinf() → isinf
is_finite(), isfinite() → isfinite
is_normal(), isnormal() → isnormal
signbit() - Check sign bit
nextafter() - Next representable value
fdim() - Positive difference

Warp Operations

warp_active_mask() → __activemask() - Active lane mask
warp_shfl(mask, val, lane) → __shfl_sync - Shuffle
warp_shfl_up(mask, val, delta) → __shfl_up_sync
warp_shfl_down(mask, val, delta) → __shfl_down_sync
warp_shfl_xor(mask, val, lane_mask) → __shfl_xor_sync
warp_ballot(mask, pred) → __ballot_sync
warp_all(mask, pred) → __all_sync
warp_any(mask, pred) → __any_sync

Warp Match Operations (Volta+)

warp_match_any(mask, val) → __match_any_sync
warp_match_all(mask, val) → __match_all_sync

Warp Reduce Operations (SM 8.0+)

warp_reduce_add(mask, val) → __reduce_add_sync
warp_reduce_min(mask, val) → __reduce_min_sync
warp_reduce_max(mask, val) → __reduce_max_sync
warp_reduce_and(mask, val) → __reduce_and_sync
warp_reduce_or(mask, val) → __reduce_or_sync
warp_reduce_xor(mask, val) → __reduce_xor_sync

Bit Manipulation

popc(), popcount(), count_ones() → __popc - Population count
clz(), leading_zeros() → __clz - Count leading zeros
ctz(), trailing_zeros() → __ffs - 1 - Count trailing zeros
ffs() → __ffs - Find first set
brev(), reverse_bits() → __brev - Bit reverse
byte_perm() → __byte_perm - Byte permutation
funnel_shift_left() → __funnelshift_l
funnel_shift_right() → __funnelshift_r

Memory Operations

ldg(ptr), load_global(ptr) → __ldg - Read-only cache load
prefetch_l1(ptr) → __prefetch_l1 - L1 prefetch
prefetch_l2(ptr) → __prefetch_l2 - L2 prefetch

Special Functions

rcp(), recip() → __frcp_rn - Fast reciprocal
fast_div() → __fdividef - Fast division
saturate(), clamp_01() → __saturatef - Saturate to [0,1]
j0(), j1(), jn() - Bessel functions of first kind
y0(), y1(), yn() - Bessel functions of second kind
normcdf(), normcdfinv() - Normal CDF
cyl_bessel_i0(), cyl_bessel_i1() - Cylindrical Bessel functions

Clock and Timing

clock() → clock() - 32-bit clock counter
clock64() → clock64() - 64-bit clock counter
nanosleep(ns) → __nanosleep - Sleep for nanoseconds

RingContext Methods

ctx.thread_id() → threadIdx.x
ctx.block_id() → blockIdx.x
ctx.global_thread_id() → (blockIdx.x * blockDim.x + threadIdx.x)
ctx.sync_threads() → __syncthreads()
ctx.lane_id() → (threadIdx.x % 32)
ctx.warp_id() → (threadIdx.x / 32)

Ring Kernel Intrinsics

is_active(), should_terminate(), mark_terminated()
messages_processed(), input_queue_size(), output_queue_size()
input_queue_empty(), output_queue_empty(), enqueue_response(&resp)
hlc_tick(), hlc_update(ts), hlc_now() - HLC operations

K2K Messaging (Envelope-Aware)

k2k_send(target, &msg) - Send message to target kernel
k2k_send_envelope(&envelope) - Send full envelope with routing
k2k_try_recv() - Non-blocking receive
k2k_try_recv_envelope() - Receive with envelope metadata
k2k_has_message(), k2k_peek(), k2k_pending_count()
k2k_get_source_kernel() - Get source kernel ID from envelope
k2k_get_correlation_id() - Get correlation ID for request/response

Type Mapping

Rust Type	CUDA Type
`f32`	`float`
`f64`	`double`
`i32`	`int`
`u32`	`unsigned int`
`i64`	`long long`
`u64`	`unsigned long long`
`bool`	`int`
`&[T]`	`const T* __restrict__`
`&mut [T]`	`T* __restrict__`

Intrinsic Count

The transpiler supports 120+ GPU intrinsics across 13 categories:

Category	Count	Examples
Synchronization	7	`sync_threads`, `thread_fence`
Atomics	11	`atomic_add`, `atomic_cas`, `atomic_and`
Math	16	`sqrt`, `fma`, `cbrt`, `hypot`
Trigonometric	11	`sin`, `asin`, `atan2`, `sincos`
Hyperbolic	6	`sinh`, `asinh`
Exponential	18	`exp`, `log2`, `erf`, `gamma`
Classification	8	`isnan`, `isfinite`, `signbit`
Warp	16	`warp_shfl`, `warp_reduce_add`, `warp_match_any`
Bit Manipulation	8	`popc`, `clz`, `brev`, `funnel_shift_left`
Memory	3	`ldg`, `prefetch_l1`
Special	13	`rcp`, `saturate`, `normcdf`
Index	13	`thread_idx_x`, `warp_size`
Timing	3	`clock`, `clock64`, `nanosleep`

Testing

cargo test -p ringkernel-cuda-codegen

The crate includes 183 tests covering all kernel types, intrinsics, envelope format, and language features.

License

Apache-2.0

ringkernel-cuda-codegen 0.4.0