Module dsl

Module dsl 

Source
Expand description

Rust DSL functions for writing CUDA kernels.

This module provides Rust functions that map to CUDA intrinsics during transpilation. These functions have CPU fallback implementations for testing but are transpiled to the corresponding CUDA operations when used in kernel code.

§Thread/Block Index Access

use ringkernel_cuda_codegen::dsl::*;

fn my_kernel(...) {
    let tx = thread_idx_x();  // -> threadIdx.x
    let bx = block_idx_x();   // -> blockIdx.x
    let idx = bx * block_dim_x() + tx;  // Global thread index
}

§Thread Synchronization

sync_threads();  // -> __syncthreads()

§Math Functions

All standard math functions are available with CPU fallbacks:

  • Trigonometric: sin, cos, tan, asin, acos, atan, atan2
  • Hyperbolic: sinh, cosh, tanh, asinh, acosh, atanh
  • Exponential: exp, exp2, exp10, expm1, log, log2, log10, log1p
  • Power: pow, sqrt, rsqrt, cbrt
  • Rounding: floor, ceil, round, trunc
  • Comparison: fmin, fmax, fdim, copysign

§Warp Operations

let mask = warp_active_mask();       // Get active lane mask
let result = warp_reduce_add(mask, value);  // Warp-level sum
let shuffled = warp_shfl(mask, value, lane); // Shuffle

§Bit Manipulation

let bits = popc(x);      // Count set bits
let zeros = clz(x);      // Count leading zeros
let rev = brev(x);       // Reverse bits

Functions§

acos
Arccosine. Transpiles to: acosf(x)
acosh
Inverse hyperbolic cosine. Transpiles to: acoshf(x)
asin
Arcsine. Transpiles to: asinf(x)
asinh
Inverse hyperbolic sine. Transpiles to: asinhf(x)
atan
Arctangent. Transpiles to: atanf(x)
atan2
Two-argument arctangent. Transpiles to: atan2f(y, x)
atanh
Inverse hyperbolic tangent. Transpiles to: atanhf(x)
atomic_add
Atomic add. Transpiles to: atomicAdd(addr, val) WARNING: CPU fallback is NOT thread-safe!
atomic_add_f32
Atomic add for f32. Transpiles to: atomicAdd(addr, val)
atomic_and
Atomic AND. Transpiles to: atomicAnd(addr, val)
atomic_cas
Atomic compare and swap. Transpiles to: atomicCAS(addr, compare, val)
atomic_dec
Atomic decrement with wrap. Transpiles to: atomicDec(addr, val)
atomic_exchange
Atomic exchange. Transpiles to: atomicExch(addr, val)
atomic_inc
Atomic increment with wrap. Transpiles to: atomicInc(addr, val)
atomic_max
Atomic maximum. Transpiles to: atomicMax(addr, val)
atomic_min
Atomic minimum. Transpiles to: atomicMin(addr, val)
atomic_or
Atomic OR. Transpiles to: atomicOr(addr, val)
atomic_sub
Atomic subtract. Transpiles to: atomicSub(addr, val)
atomic_xor
Atomic XOR. Transpiles to: atomicXor(addr, val)
block_dim_x
Get the block dimension (x dimension). Transpiles to: blockDim.x
block_dim_y
Get the block dimension (y dimension). Transpiles to: blockDim.y
block_dim_z
Get the block dimension (z dimension). Transpiles to: blockDim.z
block_idx_x
Get the block index within a grid (x dimension). Transpiles to: blockIdx.x
block_idx_y
Get the block index within a grid (y dimension). Transpiles to: blockIdx.y
block_idx_z
Get the block index within a grid (z dimension). Transpiles to: blockIdx.z
brev
Bit reverse. Transpiles to: __brev(x)
byte_perm
Byte permutation. Transpiles to: __byte_perm(x, y, s)
cbrt
Cube root. Transpiles to: cbrtf(x)
ceil
Ceiling. Transpiles to: ceilf(x)
clamp_01
Clamp to [0, 1] (alias for saturate).
clock
Read clock counter. Transpiles to: clock()
clock64
Read 64-bit clock counter. Transpiles to: clock64()
clz
Count leading zeros. Transpiles to: __clz(x)
copysign
Copy sign. Transpiles to: copysignf(x, y)
cos
Cosine. Transpiles to: cosf(x)
cosh
Hyperbolic cosine. Transpiles to: coshf(x)
cospi
Cosine of pi*x. Transpiles to: cospif(x)
ctz
Count trailing zeros. Transpiles to: __ffs(x) - 1
erf
Error function. Transpiles to: erff(x)
erfc
Complementary error function. Transpiles to: erfcf(x)
exp
Exponential (base e). Transpiles to: expf(x)
exp2
Exponential (base 2). Transpiles to: exp2f(x)
exp10
Exponential (base 10). Transpiles to: exp10f(x)
expm1
exp(x) - 1 (accurate for small x). Transpiles to: expm1f(x)
fabs
Absolute value for f32. Transpiles to: fabsf(x)
fast_div
Fast division. Transpiles to: __fdividef(x, y)
fdim
Floating-point difference. Transpiles to: fdimf(x, y)
ffs
Find first set bit (1-indexed, 0 if none). Transpiles to: __ffs(x)
floor
Floor. Transpiles to: floorf(x)
fma
Fused multiply-add. Transpiles to: fmaf(a, b, c)
fmax
Maximum. Transpiles to: fmaxf(a, b)
fmin
Minimum. Transpiles to: fminf(a, b)
fmod
Floating-point modulo. Transpiles to: fmodf(x, y)
funnel_shift_left
Funnel shift left. Transpiles to: __funnelshift_l(lo, hi, shift)
funnel_shift_right
Funnel shift right. Transpiles to: __funnelshift_r(lo, hi, shift)
grid_dim_x
Get the grid dimension (x dimension). Transpiles to: gridDim.x
grid_dim_y
Get the grid dimension (y dimension). Transpiles to: gridDim.y
grid_dim_z
Get the grid dimension (z dimension). Transpiles to: gridDim.z
hypot
Hypotenuse. Transpiles to: hypotf(x, y)
ilogb
Extract exponent. Transpiles to: ilogbf(x)
is_finite
Check if finite. Transpiles to: isfinite(x)
is_infinite
Check if infinite. Transpiles to: isinf(x)
is_nan
Check if NaN. Transpiles to: isnan(x)
is_normal
Check if normal. Transpiles to: isnormal(x)
ldexp
Load exponent. Transpiles to: ldexpf(x, exp)
ldg
Read-only cache load. Transpiles to: __ldg(ptr)
leading_zeros
Count leading zeros (i32 version).
load_global
Load from global memory (alias for ldg).
log
Natural logarithm (base e). Transpiles to: logf(x)
log2
Logarithm (base 2). Transpiles to: log2f(x)
log1p
log(1 + x) (accurate for small x). Transpiles to: log1pf(x)
log10
Logarithm (base 10). Transpiles to: log10f(x)
nanosleep
Nanosleep. Transpiles to: __nanosleep(ns)
nextafter
Next representable value. Transpiles to: nextafterf(x, y)
popc
Population count (count set bits). Transpiles to: __popc(x)
popcount
Population count (i32 version).
pow
Power. Transpiles to: powf(x, y)
prefetch_l1
Prefetch to L1 cache. Transpiles to: __prefetch_l1(ptr)
prefetch_l2
Prefetch to L2 cache. Transpiles to: __prefetch_l2(ptr)
rcp
Fast reciprocal. Transpiles to: __frcp_rn(x)
remainder
Remainder. Transpiles to: remainderf(x, y)
reverse_bits
Bit reverse (i32 version).
round
Round to nearest. Transpiles to: roundf(x)
rsqrt
Reciprocal square root. Transpiles to: rsqrtf(x)
saturate
Saturate to [0, 1]. Transpiles to: __saturatef(x)
scalbn
Scale by power of 2. Transpiles to: scalbnf(x, n)
signbit
Check sign bit. Transpiles to: signbit(x)
sin
Sine. Transpiles to: sinf(x)
sincos
Sine and cosine together. Transpiles to: sincosf(x, &s, &c)
sinh
Hyperbolic sine. Transpiles to: sinhf(x)
sinpi
Sine of pi*x. Transpiles to: sinpif(x)
sqrt
Square root. Transpiles to: sqrtf(x)
sync_threads
Synchronize all threads in a block. Transpiles to: __syncthreads()
sync_threads_and
Synchronize threads with AND of predicate. Transpiles to: __syncthreads_and(predicate)
sync_threads_count
Synchronize threads and count predicate. Transpiles to: __syncthreads_count(predicate)
sync_threads_or
Synchronize threads with OR of predicate. Transpiles to: __syncthreads_or(predicate)
tan
Tangent. Transpiles to: tanf(x)
tanh
Hyperbolic tangent. Transpiles to: tanhf(x)
thread_fence
Thread memory fence. Transpiles to: __threadfence()
thread_fence_block
Block-level memory fence. Transpiles to: __threadfence_block()
thread_fence_system
System-wide memory fence. Transpiles to: __threadfence_system()
thread_idx_x
Get the thread index within a block (x dimension). Transpiles to: threadIdx.x
thread_idx_y
Get the thread index within a block (y dimension). Transpiles to: threadIdx.y
thread_idx_z
Get the thread index within a block (z dimension). Transpiles to: threadIdx.z
trailing_zeros
Count trailing zeros (i32 version).
trunc
Truncate toward zero. Transpiles to: truncf(x)
warp_active_mask
Get active thread mask. Transpiles to: __activemask()
warp_all
Warp all predicate. Transpiles to: __all_sync(mask, predicate)
warp_any
Warp any predicate. Transpiles to: __any_sync(mask, predicate)
warp_ballot
Warp ballot. Transpiles to: __ballot_sync(mask, predicate)
warp_match_all
Warp match all. Transpiles to: __match_all_sync(mask, val, pred)
warp_match_any
Warp match any. Transpiles to: __match_any_sync(mask, val)
warp_reduce_add
Warp reduce add. Transpiles to: __reduce_add_sync(mask, val)
warp_reduce_and
Warp reduce AND. Transpiles to: __reduce_and_sync(mask, val)
warp_reduce_max
Warp reduce max. Transpiles to: __reduce_max_sync(mask, val)
warp_reduce_min
Warp reduce min. Transpiles to: __reduce_min_sync(mask, val)
warp_reduce_or
Warp reduce OR. Transpiles to: __reduce_or_sync(mask, val)
warp_reduce_xor
Warp reduce XOR. Transpiles to: __reduce_xor_sync(mask, val)
warp_shfl
Warp shuffle. Transpiles to: __shfl_sync(mask, val, lane)
warp_shfl_down
Warp shuffle down. Transpiles to: __shfl_down_sync(mask, val, delta)
warp_shfl_up
Warp shuffle up. Transpiles to: __shfl_up_sync(mask, val, delta)
warp_shfl_xor
Warp shuffle XOR. Transpiles to: __shfl_xor_sync(mask, val, lane_mask)
warp_size
Get the warp size (always 32 on NVIDIA GPUs). Transpiles to: warpSize