Module dsl

Expand description

Rust DSL functions for writing CUDA kernels.

This module provides Rust functions that map to CUDA intrinsics during transpilation. These functions have CPU fallback implementations for testing but are transpiled to the corresponding CUDA operations when used in kernel code.

§Thread/Block Index Access

use ringkernel_cuda_codegen::dsl::*;

fn my_kernel(...) {
    let tx = thread_idx_x();  // -> threadIdx.x
    let bx = block_idx_x();   // -> blockIdx.x
    let idx = bx * block_dim_x() + tx;  // Global thread index
}

§Thread Synchronization

sync_threads();  // -> __syncthreads()

§Math Functions

All standard math functions are available with CPU fallbacks:

Trigonometric: sin, cos, tan, asin, acos, atan, atan2
Hyperbolic: sinh, cosh, tanh, asinh, acosh, atanh
Exponential: exp, exp2, exp10, expm1, log, log2, log10, log1p
Power: pow, sqrt, rsqrt, cbrt
Rounding: floor, ceil, round, trunc
Comparison: fmin, fmax, fdim, copysign

§Warp Operations

let mask = warp_active_mask();       // Get active lane mask
let result = warp_reduce_add(mask, value);  // Warp-level sum
let shuffled = warp_shfl(mask, value, lane); // Shuffle

§Bit Manipulation

let bits = popc(x);      // Count set bits
let zeros = clz(x);      // Count leading zeros
let rev = brev(x);       // Reverse bits

Functions§

acos: Arccosine. Transpiles to: acosf(x)
acosh: Inverse hyperbolic cosine. Transpiles to: acoshf(x)
asin: Arcsine. Transpiles to: asinf(x)
asinh: Inverse hyperbolic sine. Transpiles to: asinhf(x)
atan: Arctangent. Transpiles to: atanf(x)
atan2: Two-argument arctangent. Transpiles to: atan2f(y, x)
atanh: Inverse hyperbolic tangent. Transpiles to: atanhf(x)
atomic_add: Atomic add. Transpiles to: atomicAdd(addr, val) WARNING: CPU fallback is NOT thread-safe!
atomic_add_f32: Atomic add for f32. Transpiles to: atomicAdd(addr, val)
atomic_and: Atomic AND. Transpiles to: atomicAnd(addr, val)
atomic_cas: Atomic compare and swap. Transpiles to: atomicCAS(addr, compare, val)
atomic_dec: Atomic decrement with wrap. Transpiles to: atomicDec(addr, val)
atomic_exchange: Atomic exchange. Transpiles to: atomicExch(addr, val)
atomic_inc: Atomic increment with wrap. Transpiles to: atomicInc(addr, val)
atomic_max: Atomic maximum. Transpiles to: atomicMax(addr, val)
atomic_min: Atomic minimum. Transpiles to: atomicMin(addr, val)
atomic_or: Atomic OR. Transpiles to: atomicOr(addr, val)
atomic_sub: Atomic subtract. Transpiles to: atomicSub(addr, val)
atomic_xor: Atomic XOR. Transpiles to: atomicXor(addr, val)
block_dim_x: Get the block dimension (x dimension). Transpiles to: blockDim.x
block_dim_y: Get the block dimension (y dimension). Transpiles to: blockDim.y
block_dim_z: Get the block dimension (z dimension). Transpiles to: blockDim.z
block_idx_x: Get the block index within a grid (x dimension). Transpiles to: blockIdx.x
block_idx_y: Get the block index within a grid (y dimension). Transpiles to: blockIdx.y
block_idx_z: Get the block index within a grid (z dimension). Transpiles to: blockIdx.z
brev: Bit reverse. Transpiles to: __brev(x)
byte_perm: Byte permutation. Transpiles to: __byte_perm(x, y, s)
cbrt: Cube root. Transpiles to: cbrtf(x)
ceil: Ceiling. Transpiles to: ceilf(x)
clamp_01: Clamp to [0, 1] (alias for saturate).
clock: Read clock counter. Transpiles to: clock()
clock64: Read 64-bit clock counter. Transpiles to: clock64()
clz: Count leading zeros. Transpiles to: __clz(x)
copysign: Copy sign. Transpiles to: copysignf(x, y)
cos: Cosine. Transpiles to: cosf(x)
cosh: Hyperbolic cosine. Transpiles to: coshf(x)
cospi: Cosine of pi*x. Transpiles to: cospif(x)
ctz: Count trailing zeros. Transpiles to: __ffs(x) - 1
erf: Error function. Transpiles to: erff(x)
erfc: Complementary error function. Transpiles to: erfcf(x)
exp: Exponential (base e). Transpiles to: expf(x)
exp2: Exponential (base 2). Transpiles to: exp2f(x)
exp10: Exponential (base 10). Transpiles to: exp10f(x)
expm1: exp(x) - 1 (accurate for small x). Transpiles to: expm1f(x)
fabs: Absolute value for f32. Transpiles to: fabsf(x)
fast_div: Fast division. Transpiles to: __fdividef(x, y)
fdim: Floating-point difference. Transpiles to: fdimf(x, y)
ffs: Find first set bit (1-indexed, 0 if none). Transpiles to: __ffs(x)
floor: Floor. Transpiles to: floorf(x)
fma: Fused multiply-add. Transpiles to: fmaf(a, b, c)
fmax: Maximum. Transpiles to: fmaxf(a, b)
fmin: Minimum. Transpiles to: fminf(a, b)
fmod: Floating-point modulo. Transpiles to: fmodf(x, y)
funnel_shift_left: Funnel shift left. Transpiles to: __funnelshift_l(lo, hi, shift)
funnel_shift_right: Funnel shift right. Transpiles to: __funnelshift_r(lo, hi, shift)
grid_dim_x: Get the grid dimension (x dimension). Transpiles to: gridDim.x
grid_dim_y: Get the grid dimension (y dimension). Transpiles to: gridDim.y
grid_dim_z: Get the grid dimension (z dimension). Transpiles to: gridDim.z
hypot: Hypotenuse. Transpiles to: hypotf(x, y)
ilogb: Extract exponent. Transpiles to: ilogbf(x)
is_finite: Check if finite. Transpiles to: isfinite(x)
is_infinite: Check if infinite. Transpiles to: isinf(x)
is_nan: Check if NaN. Transpiles to: isnan(x)
is_normal: Check if normal. Transpiles to: isnormal(x)
ldexp: Load exponent. Transpiles to: ldexpf(x, exp)
ldg: Read-only cache load. Transpiles to: __ldg(ptr)
leading_zeros: Count leading zeros (i32 version).
load_global: Load from global memory (alias for ldg).
log: Natural logarithm (base e). Transpiles to: logf(x)
log2: Logarithm (base 2). Transpiles to: log2f(x)
log1p: log(1 + x) (accurate for small x). Transpiles to: log1pf(x)
log10: Logarithm (base 10). Transpiles to: log10f(x)
nanosleep: Nanosleep. Transpiles to: __nanosleep(ns)
nextafter: Next representable value. Transpiles to: nextafterf(x, y)
popc: Population count (count set bits). Transpiles to: __popc(x)
popcount: Population count (i32 version).
pow: Power. Transpiles to: powf(x, y)
prefetch_l1: Prefetch to L1 cache. Transpiles to: __prefetch_l1(ptr)
prefetch_l2: Prefetch to L2 cache. Transpiles to: __prefetch_l2(ptr)
rcp: Fast reciprocal. Transpiles to: __frcp_rn(x)
remainder: Remainder. Transpiles to: remainderf(x, y)
reverse_bits: Bit reverse (i32 version).
round: Round to nearest. Transpiles to: roundf(x)
rsqrt: Reciprocal square root. Transpiles to: rsqrtf(x)
saturate: Saturate to [0, 1]. Transpiles to: __saturatef(x)
scalbn: Scale by power of 2. Transpiles to: scalbnf(x, n)
signbit: Check sign bit. Transpiles to: signbit(x)
sin: Sine. Transpiles to: sinf(x)
sincos: Sine and cosine together. Transpiles to: sincosf(x, &s, &c)
sinh: Hyperbolic sine. Transpiles to: sinhf(x)
sinpi: Sine of pi*x. Transpiles to: sinpif(x)
sqrt: Square root. Transpiles to: sqrtf(x)
sync_threads: Synchronize all threads in a block. Transpiles to: __syncthreads()
sync_threads_and: Synchronize threads with AND of predicate. Transpiles to: __syncthreads_and(predicate)
sync_threads_count: Synchronize threads and count predicate. Transpiles to: __syncthreads_count(predicate)
sync_threads_or: Synchronize threads with OR of predicate. Transpiles to: __syncthreads_or(predicate)
tan: Tangent. Transpiles to: tanf(x)
tanh: Hyperbolic tangent. Transpiles to: tanhf(x)
thread_fence: Thread memory fence. Transpiles to: __threadfence()
thread_fence_block: Block-level memory fence. Transpiles to: __threadfence_block()
thread_fence_system: System-wide memory fence. Transpiles to: __threadfence_system()
thread_idx_x: Get the thread index within a block (x dimension). Transpiles to: threadIdx.x
thread_idx_y: Get the thread index within a block (y dimension). Transpiles to: threadIdx.y
thread_idx_z: Get the thread index within a block (z dimension). Transpiles to: threadIdx.z
trailing_zeros: Count trailing zeros (i32 version).
trunc: Truncate toward zero. Transpiles to: truncf(x)
warp_active_mask: Get active thread mask. Transpiles to: __activemask()
warp_all: Warp all predicate. Transpiles to: __all_sync(mask, predicate)
warp_any: Warp any predicate. Transpiles to: __any_sync(mask, predicate)
warp_ballot: Warp ballot. Transpiles to: __ballot_sync(mask, predicate)
warp_match_all: Warp match all. Transpiles to: __match_all_sync(mask, val, pred)
warp_match_any: Warp match any. Transpiles to: __match_any_sync(mask, val)
warp_reduce_add: Warp reduce add. Transpiles to: __reduce_add_sync(mask, val)
warp_reduce_and: Warp reduce AND. Transpiles to: __reduce_and_sync(mask, val)
warp_reduce_max: Warp reduce max. Transpiles to: __reduce_max_sync(mask, val)
warp_reduce_min: Warp reduce min. Transpiles to: __reduce_min_sync(mask, val)
warp_reduce_or: Warp reduce OR. Transpiles to: __reduce_or_sync(mask, val)
warp_reduce_xor: Warp reduce XOR. Transpiles to: __reduce_xor_sync(mask, val)
warp_shfl: Warp shuffle. Transpiles to: __shfl_sync(mask, val, lane)
warp_shfl_down: Warp shuffle down. Transpiles to: __shfl_down_sync(mask, val, delta)
warp_shfl_up: Warp shuffle up. Transpiles to: __shfl_up_sync(mask, val, delta)
warp_shfl_xor: Warp shuffle XOR. Transpiles to: __shfl_xor_sync(mask, val, lane_mask)
warp_size: Get the warp size (always 32 on NVIDIA GPUs). Transpiles to: warpSize

Module dsl

Module dsl Copy item path

§Thread/Block Index Access

§Thread Synchronization

§Math Functions

§Warp Operations

§Bit Manipulation

Functions§

Module dsl