ringkernel-cuda-codegen
Rust-to-CUDA transpiler for RingKernel GPU kernels.
Overview
This crate enables writing GPU kernels in a restricted Rust DSL and transpiling them to CUDA C code. It supports three kernel types:
- Global Kernels - Standard CUDA
__global__functions - Stencil Kernels - Tile-based kernels with
GridPosabstraction (2D and 3D) - Ring Kernels - Persistent actor kernels with message loops
Installation
[]
= "0.1"
= { = "2.0", = ["full"] }
Global Kernels
For general-purpose CUDA kernels:
use transpile_global_kernel;
use parse_quote;
let func: ItemFn = parse_quote! ;
let cuda_code = transpile_global_kernel?;
Stencil Kernels
For grid-based computations with neighbor access (2D and 3D):
use ;
// 2D stencil
let func: ItemFn = parse_quote! ;
let config = new
.with_grid
.with_tile_size
.with_halo;
let cuda_code = transpile_stencil_kernel?;
// 3D stencil with up/down neighbors
let func_3d: ItemFn = parse_quote! ;
let config_3d = new
.with_grid
.with_tile_size
.with_halo;
Ring Kernels
For persistent actor-model kernels:
use ;
let handler: ItemFn = parse_quote! ;
let config = new
.with_block_size
.with_queue_capacity
.with_hlc // Hybrid Logical Clocks
.with_k2k; // Kernel-to-kernel messaging
let cuda_code = transpile_ring_kernel?;
DSL Reference
Thread/Block Indices
thread_idx_x(),thread_idx_y(),thread_idx_z()→threadIdx.x/y/zblock_idx_x(),block_idx_y(),block_idx_z()→blockIdx.x/y/zblock_dim_x(),block_dim_y(),block_dim_z()→blockDim.x/y/zgrid_dim_x(),grid_dim_y(),grid_dim_z()→gridDim.x/y/zwarp_size()→warpSize
Stencil Intrinsics (2D)
pos.idx()- Linear indexpos.north(buf),pos.south(buf)- Y-axis neighborspos.east(buf),pos.west(buf)- X-axis neighborspos.at(buf, dx, dy)- Relative offset access
Stencil Intrinsics (3D)
pos.up(buf),pos.down(buf)- Z-axis neighborspos.at(buf, dx, dy, dz)- 3D relative offset access
Synchronization
sync_threads()→__syncthreads()- Block-level barriersync_threads_count(pred)→__syncthreads_count()- Count threads with predicatesync_threads_and(pred)→__syncthreads_and()- AND of predicatesync_threads_or(pred)→__syncthreads_or()- OR of predicatethread_fence()→__threadfence()- Device memory fencethread_fence_block()→__threadfence_block()- Block memory fencethread_fence_system()→__threadfence_system()- System memory fence
Atomic Operations (Integer)
atomic_add(ptr, val)→atomicAddatomic_sub(ptr, val)→atomicSubatomic_min(ptr, val)→atomicMinatomic_max(ptr, val)→atomicMaxatomic_exchange(ptr, val)→atomicExchatomic_cas(ptr, compare, val)→atomicCASatomic_and(ptr, val)→atomicAndatomic_or(ptr, val)→atomicOratomic_xor(ptr, val)→atomicXoratomic_inc(ptr, val)→atomicInc(increment with wrap)atomic_dec(ptr, val)→atomicDec(decrement with wrap)
Basic Math Functions
sqrt(),rsqrt()- Square root, reciprocal sqrtabs(),fabs()- Absolute valuefloor(),ceil(),round(),trunc()- Roundingfma(),mul_add()- Fused multiply-addfmin(),fmax()- Minimum, maximumfmod(),remainder()- Modulo operationscopysign()- Copy signcbrt()- Cube roothypot()- Hypotenuse
Trigonometric Functions
sin(),cos(),tan()- Basic trigasin(),acos(),atan(),atan2()- Inverse trigsincos()- Combined sine and cosinesinpi(),cospi()- Sin/cos of π*x
Hyperbolic Functions
sinh(),cosh(),tanh()- Hyperbolicasinh(),acosh(),atanh()- Inverse hyperbolic
Exponential and Logarithmic Functions
exp(),exp2(),exp10(),expm1()- Exponentialslog(),ln(),log2(),log10(),log1p()- Logarithmspow(),powf(),powi()- Powerldexp(),scalbn()- Load/scale exponentilogb()- Extract exponenterf(),erfc(),erfinv(),erfcinv()- Error functionslgamma(),tgamma()- Gamma functions
Classification Functions
is_nan(),isnan()→isnanis_infinite(),isinf()→isinfis_finite(),isfinite()→isfiniteis_normal(),isnormal()→isnormalsignbit()- Check sign bitnextafter()- Next representable valuefdim()- Positive difference
Warp Operations
warp_active_mask()→__activemask()- Active lane maskwarp_shfl(mask, val, lane)→__shfl_sync- Shufflewarp_shfl_up(mask, val, delta)→__shfl_up_syncwarp_shfl_down(mask, val, delta)→__shfl_down_syncwarp_shfl_xor(mask, val, lane_mask)→__shfl_xor_syncwarp_ballot(mask, pred)→__ballot_syncwarp_all(mask, pred)→__all_syncwarp_any(mask, pred)→__any_sync
Warp Match Operations (Volta+)
warp_match_any(mask, val)→__match_any_syncwarp_match_all(mask, val)→__match_all_sync
Warp Reduce Operations (SM 8.0+)
warp_reduce_add(mask, val)→__reduce_add_syncwarp_reduce_min(mask, val)→__reduce_min_syncwarp_reduce_max(mask, val)→__reduce_max_syncwarp_reduce_and(mask, val)→__reduce_and_syncwarp_reduce_or(mask, val)→__reduce_or_syncwarp_reduce_xor(mask, val)→__reduce_xor_sync
Bit Manipulation
popc(),popcount(),count_ones()→__popc- Population countclz(),leading_zeros()→__clz- Count leading zerosctz(),trailing_zeros()→__ffs - 1- Count trailing zerosffs()→__ffs- Find first setbrev(),reverse_bits()→__brev- Bit reversebyte_perm()→__byte_perm- Byte permutationfunnel_shift_left()→__funnelshift_lfunnel_shift_right()→__funnelshift_r
Memory Operations
ldg(ptr),load_global(ptr)→__ldg- Read-only cache loadprefetch_l1(ptr)→__prefetch_l1- L1 prefetchprefetch_l2(ptr)→__prefetch_l2- L2 prefetch
Special Functions
rcp(),recip()→__frcp_rn- Fast reciprocalfast_div()→__fdividef- Fast divisionsaturate(),clamp_01()→__saturatef- Saturate to [0,1]j0(),j1(),jn()- Bessel functions of first kindy0(),y1(),yn()- Bessel functions of second kindnormcdf(),normcdfinv()- Normal CDFcyl_bessel_i0(),cyl_bessel_i1()- Cylindrical Bessel functions
Clock and Timing
clock()→clock()- 32-bit clock counterclock64()→clock64()- 64-bit clock counternanosleep(ns)→__nanosleep- Sleep for nanoseconds
RingContext Methods
ctx.thread_id()→threadIdx.xctx.block_id()→blockIdx.xctx.global_thread_id()→(blockIdx.x * blockDim.x + threadIdx.x)ctx.sync_threads()→__syncthreads()ctx.lane_id()→(threadIdx.x % 32)ctx.warp_id()→(threadIdx.x / 32)
Ring Kernel Intrinsics
is_active(),should_terminate(),mark_terminated()messages_processed(),input_queue_size(),output_queue_size()input_queue_empty(),output_queue_empty(),enqueue_response(&resp)hlc_tick(),hlc_update(ts),hlc_now()- HLC operationsk2k_send(target, &msg),k2k_try_recv()- K2K messagingk2k_has_message(),k2k_peek(),k2k_pending_count()
Type Mapping
| Rust Type | CUDA Type |
|---|---|
f32 |
float |
f64 |
double |
i32 |
int |
u32 |
unsigned int |
i64 |
long long |
u64 |
unsigned long long |
bool |
int |
&[T] |
const T* __restrict__ |
&mut [T] |
T* __restrict__ |
Intrinsic Count
The transpiler supports 120+ GPU intrinsics across 13 categories:
| Category | Count | Examples |
|---|---|---|
| Synchronization | 7 | sync_threads, thread_fence |
| Atomics | 11 | atomic_add, atomic_cas, atomic_and |
| Math | 16 | sqrt, fma, cbrt, hypot |
| Trigonometric | 11 | sin, asin, atan2, sincos |
| Hyperbolic | 6 | sinh, asinh |
| Exponential | 18 | exp, log2, erf, gamma |
| Classification | 8 | isnan, isfinite, signbit |
| Warp | 16 | warp_shfl, warp_reduce_add, warp_match_any |
| Bit Manipulation | 8 | popc, clz, brev, funnel_shift_left |
| Memory | 3 | ldg, prefetch_l1 |
| Special | 13 | rcp, saturate, normcdf |
| Index | 13 | thread_idx_x, warp_size |
| Timing | 3 | clock, clock64, nanosleep |
Testing
The crate includes 171 tests covering all kernel types, intrinsics, and language features.
License
Apache-2.0