Expand description
Rust DSL functions for writing CUDA kernels.
This module provides Rust functions that map to CUDA intrinsics during transpilation. These functions have CPU fallback implementations for testing but are transpiled to the corresponding CUDA operations when used in kernel code.
§Thread/Block Index Access
ⓘ
use ringkernel_cuda_codegen::dsl::*;
fn my_kernel(...) {
let tx = thread_idx_x(); // -> threadIdx.x
let bx = block_idx_x(); // -> blockIdx.x
let idx = bx * block_dim_x() + tx; // Global thread index
}§Thread Synchronization
ⓘ
sync_threads(); // -> __syncthreads()§Math Functions
All standard math functions are available with CPU fallbacks:
- Trigonometric: sin, cos, tan, asin, acos, atan, atan2
- Hyperbolic: sinh, cosh, tanh, asinh, acosh, atanh
- Exponential: exp, exp2, exp10, expm1, log, log2, log10, log1p
- Power: pow, sqrt, rsqrt, cbrt
- Rounding: floor, ceil, round, trunc
- Comparison: fmin, fmax, fdim, copysign
§Warp Operations
ⓘ
let mask = warp_active_mask(); // Get active lane mask
let result = warp_reduce_add(mask, value); // Warp-level sum
let shuffled = warp_shfl(mask, value, lane); // Shuffle§Bit Manipulation
ⓘ
let bits = popc(x); // Count set bits
let zeros = clz(x); // Count leading zeros
let rev = brev(x); // Reverse bitsFunctions§
- acos
- Arccosine. Transpiles to:
acosf(x) - acosh
- Inverse hyperbolic cosine. Transpiles to:
acoshf(x) - asin
- Arcsine. Transpiles to:
asinf(x) - asinh
- Inverse hyperbolic sine. Transpiles to:
asinhf(x) - atan
- Arctangent. Transpiles to:
atanf(x) - atan2
- Two-argument arctangent. Transpiles to:
atan2f(y, x) - atanh
- Inverse hyperbolic tangent. Transpiles to:
atanhf(x) - atomic_
add - Atomic add. Transpiles to:
atomicAdd(addr, val)WARNING: CPU fallback is NOT thread-safe! - atomic_
add_ f32 - Atomic add for f32. Transpiles to:
atomicAdd(addr, val) - atomic_
and - Atomic AND. Transpiles to:
atomicAnd(addr, val) - atomic_
cas - Atomic compare and swap. Transpiles to:
atomicCAS(addr, compare, val) - atomic_
dec - Atomic decrement with wrap. Transpiles to:
atomicDec(addr, val) - atomic_
exchange - Atomic exchange. Transpiles to:
atomicExch(addr, val) - atomic_
inc - Atomic increment with wrap. Transpiles to:
atomicInc(addr, val) - atomic_
max - Atomic maximum. Transpiles to:
atomicMax(addr, val) - atomic_
min - Atomic minimum. Transpiles to:
atomicMin(addr, val) - atomic_
or - Atomic OR. Transpiles to:
atomicOr(addr, val) - atomic_
sub - Atomic subtract. Transpiles to:
atomicSub(addr, val) - atomic_
xor - Atomic XOR. Transpiles to:
atomicXor(addr, val) - block_
dim_ x - Get the block dimension (x dimension).
Transpiles to:
blockDim.x - block_
dim_ y - Get the block dimension (y dimension).
Transpiles to:
blockDim.y - block_
dim_ z - Get the block dimension (z dimension).
Transpiles to:
blockDim.z - block_
idx_ x - Get the block index within a grid (x dimension).
Transpiles to:
blockIdx.x - block_
idx_ y - Get the block index within a grid (y dimension).
Transpiles to:
blockIdx.y - block_
idx_ z - Get the block index within a grid (z dimension).
Transpiles to:
blockIdx.z - brev
- Bit reverse. Transpiles to:
__brev(x) - byte_
perm - Byte permutation. Transpiles to:
__byte_perm(x, y, s) - cbrt
- Cube root. Transpiles to:
cbrtf(x) - ceil
- Ceiling. Transpiles to:
ceilf(x) - clamp_
01 - Clamp to [0, 1] (alias for saturate).
- clock
- Read clock counter. Transpiles to:
clock() - clock64
- Read 64-bit clock counter. Transpiles to:
clock64() - clz
- Count leading zeros. Transpiles to:
__clz(x) - copysign
- Copy sign. Transpiles to:
copysignf(x, y) - cos
- Cosine. Transpiles to:
cosf(x) - cosh
- Hyperbolic cosine. Transpiles to:
coshf(x) - cospi
- Cosine of pi*x. Transpiles to:
cospif(x) - ctz
- Count trailing zeros. Transpiles to:
__ffs(x) - 1 - erf
- Error function. Transpiles to:
erff(x) - erfc
- Complementary error function. Transpiles to:
erfcf(x) - exp
- Exponential (base e). Transpiles to:
expf(x) - exp2
- Exponential (base 2). Transpiles to:
exp2f(x) - exp10
- Exponential (base 10). Transpiles to:
exp10f(x) - expm1
- exp(x) - 1 (accurate for small x). Transpiles to:
expm1f(x) - fabs
- Absolute value for f32. Transpiles to:
fabsf(x) - fast_
div - Fast division. Transpiles to:
__fdividef(x, y) - fdim
- Floating-point difference. Transpiles to:
fdimf(x, y) - ffs
- Find first set bit (1-indexed, 0 if none). Transpiles to:
__ffs(x) - floor
- Floor. Transpiles to:
floorf(x) - fma
- Fused multiply-add. Transpiles to:
fmaf(a, b, c) - fmax
- Maximum. Transpiles to:
fmaxf(a, b) - fmin
- Minimum. Transpiles to:
fminf(a, b) - fmod
- Floating-point modulo. Transpiles to:
fmodf(x, y) - funnel_
shift_ left - Funnel shift left. Transpiles to:
__funnelshift_l(lo, hi, shift) - funnel_
shift_ right - Funnel shift right. Transpiles to:
__funnelshift_r(lo, hi, shift) - grid_
dim_ x - Get the grid dimension (x dimension).
Transpiles to:
gridDim.x - grid_
dim_ y - Get the grid dimension (y dimension).
Transpiles to:
gridDim.y - grid_
dim_ z - Get the grid dimension (z dimension).
Transpiles to:
gridDim.z - hypot
- Hypotenuse. Transpiles to:
hypotf(x, y) - ilogb
- Extract exponent. Transpiles to:
ilogbf(x) - is_
finite - Check if finite. Transpiles to:
isfinite(x) - is_
infinite - Check if infinite. Transpiles to:
isinf(x) - is_nan
- Check if NaN. Transpiles to:
isnan(x) - is_
normal - Check if normal. Transpiles to:
isnormal(x) - ldexp
- Load exponent. Transpiles to:
ldexpf(x, exp) - ldg
- Read-only cache load. Transpiles to:
__ldg(ptr) - leading_
zeros - Count leading zeros (i32 version).
- load_
global - Load from global memory (alias for ldg).
- log
- Natural logarithm (base e). Transpiles to:
logf(x) - log2
- Logarithm (base 2). Transpiles to:
log2f(x) - log1p
- log(1 + x) (accurate for small x). Transpiles to:
log1pf(x) - log10
- Logarithm (base 10). Transpiles to:
log10f(x) - nanosleep
- Nanosleep. Transpiles to:
__nanosleep(ns) - nextafter
- Next representable value. Transpiles to:
nextafterf(x, y) - popc
- Population count (count set bits). Transpiles to:
__popc(x) - popcount
- Population count (i32 version).
- pow
- Power. Transpiles to:
powf(x, y) - prefetch_
l1 - Prefetch to L1 cache. Transpiles to:
__prefetch_l1(ptr) - prefetch_
l2 - Prefetch to L2 cache. Transpiles to:
__prefetch_l2(ptr) - rcp
- Fast reciprocal. Transpiles to:
__frcp_rn(x) - remainder
- Remainder. Transpiles to:
remainderf(x, y) - reverse_
bits - Bit reverse (i32 version).
- round
- Round to nearest. Transpiles to:
roundf(x) - rsqrt
- Reciprocal square root. Transpiles to:
rsqrtf(x) - saturate
- Saturate to [0, 1]. Transpiles to:
__saturatef(x) - scalbn
- Scale by power of 2. Transpiles to:
scalbnf(x, n) - signbit
- Check sign bit. Transpiles to:
signbit(x) - sin
- Sine. Transpiles to:
sinf(x) - sincos
- Sine and cosine together. Transpiles to:
sincosf(x, &s, &c) - sinh
- Hyperbolic sine. Transpiles to:
sinhf(x) - sinpi
- Sine of pi*x. Transpiles to:
sinpif(x) - sqrt
- Square root. Transpiles to:
sqrtf(x) - sync_
threads - Synchronize all threads in a block.
Transpiles to:
__syncthreads() - sync_
threads_ and - Synchronize threads with AND of predicate.
Transpiles to:
__syncthreads_and(predicate) - sync_
threads_ count - Synchronize threads and count predicate.
Transpiles to:
__syncthreads_count(predicate) - sync_
threads_ or - Synchronize threads with OR of predicate.
Transpiles to:
__syncthreads_or(predicate) - tan
- Tangent. Transpiles to:
tanf(x) - tanh
- Hyperbolic tangent. Transpiles to:
tanhf(x) - thread_
fence - Thread memory fence.
Transpiles to:
__threadfence() - thread_
fence_ block - Block-level memory fence.
Transpiles to:
__threadfence_block() - thread_
fence_ system - System-wide memory fence.
Transpiles to:
__threadfence_system() - thread_
idx_ x - Get the thread index within a block (x dimension).
Transpiles to:
threadIdx.x - thread_
idx_ y - Get the thread index within a block (y dimension).
Transpiles to:
threadIdx.y - thread_
idx_ z - Get the thread index within a block (z dimension).
Transpiles to:
threadIdx.z - trailing_
zeros - Count trailing zeros (i32 version).
- trunc
- Truncate toward zero. Transpiles to:
truncf(x) - warp_
active_ mask - Get active thread mask. Transpiles to:
__activemask() - warp_
all - Warp all predicate. Transpiles to:
__all_sync(mask, predicate) - warp_
any - Warp any predicate. Transpiles to:
__any_sync(mask, predicate) - warp_
ballot - Warp ballot. Transpiles to:
__ballot_sync(mask, predicate) - warp_
match_ all - Warp match all. Transpiles to:
__match_all_sync(mask, val, pred) - warp_
match_ any - Warp match any. Transpiles to:
__match_any_sync(mask, val) - warp_
reduce_ add - Warp reduce add. Transpiles to:
__reduce_add_sync(mask, val) - warp_
reduce_ and - Warp reduce AND. Transpiles to:
__reduce_and_sync(mask, val) - warp_
reduce_ max - Warp reduce max. Transpiles to:
__reduce_max_sync(mask, val) - warp_
reduce_ min - Warp reduce min. Transpiles to:
__reduce_min_sync(mask, val) - warp_
reduce_ or - Warp reduce OR. Transpiles to:
__reduce_or_sync(mask, val) - warp_
reduce_ xor - Warp reduce XOR. Transpiles to:
__reduce_xor_sync(mask, val) - warp_
shfl - Warp shuffle. Transpiles to:
__shfl_sync(mask, val, lane) - warp_
shfl_ down - Warp shuffle down. Transpiles to:
__shfl_down_sync(mask, val, delta) - warp_
shfl_ up - Warp shuffle up. Transpiles to:
__shfl_up_sync(mask, val, delta) - warp_
shfl_ xor - Warp shuffle XOR. Transpiles to:
__shfl_xor_sync(mask, val, lane_mask) - warp_
size - Get the warp size (always 32 on NVIDIA GPUs).
Transpiles to:
warpSize