Expand description
Execution kernels and backend abstraction for yscv.
§GPU Inference (Cross-Platform via wgpu)
The gpu feature enables compute shader acceleration via wgpu —
Vulkan (Linux/Windows/Android), Metal (macOS/iOS), DX12 (Windows).
No CUDA dependency. GPU-accelerated operations:
- Matrix multiplication (tiled 16×16 workgroups)
- Elementwise: add, sub, mul
- Activations: relu, sigmoid
- Normalization: batch_norm, layer_norm, group_norm, rms_norm, softmax
- Convolution: conv2d, depthwise_conv2d, separable_conv2d, transpose_conv2d
- Pooling: max_pool2d, avg_pool2d
GPU training (backward passes) is on the roadmap. CPU backend is fully optimized with NEON/AVX/SSE SIMD on all platforms.
Structs§
- Batch
Norm2d Params - Tensor parameter bundle for NHWC batch-normalization inference.
- CpuBackend
- Deterministic CPU backend with fixed operation order.
- Group
Norm Nhwc Params - Tensor parameter bundle for NHWC group normalization.
- Layer
Norm Last DimParams - Tensor parameter bundle for layer normalization over the last tensor dimension.
- Parallel
Elementwise Config - Parallel heuristics for CPU elementwise operations.
- Parallel
Matmul Config - Parallel heuristics for CPU matmul row-splitting.
- RmsNorm
Last DimParams - Tensor parameter bundle for RMS normalization over the last tensor dimension.
- Separable
Conv2d Params - Tensor parameter bundle for NHWC separable convolution:
depthwise (
[KH, KW, C, depth_multiplier]) then pointwise ([1, 1, C*depth_multiplier, C_out]). - Threaded
CpuBackend - CPU backend with a dedicated rayon thread pool for predictable kernel threading depth.
- Threaded
CpuBackend Config - Runtime knobs for threaded CPU backend execution behavior.
Enums§
- Binary
Kind - Kernel
Error - Errors returned by kernel backends.
Constants§
- CRATE_
ID - DEFAULT_
ELEMENTWISE_ MIN_ PARALLEL_ ELEMENTS - DEFAULT_
MATMUL_ MIN_ PARALLEL_ OUTPUT_ ELEMENTS - DEFAULT_
MATMUL_ MIN_ PARALLEL_ SHARED_ DIM
Traits§
- Backend
- Runtime backend contract for core deterministic kernels.
- Backward
Ops - Extension trait for backward-pass operations.
Functions§
- add
- Backend-agnostic convenience call for add.
- add_
reduce_ dispatch - Sum all values in
data. Returns0.0for empty slices. - add_
with_ config - Backend-agnostic add with explicit elementwise parallelization heuristics.
- add_
with_ config_ and_ pool - avg_
pool2d_ nhwc - NHWC average-pooling without padding.
- avg_
pool2d_ nhwc_ with_ config - NHWC average-pooling without padding with explicit parallelization heuristics.
- avg_
pool2d_ nhwc_ with_ config_ and_ pool - batch_
norm2d_ nhwc - NHWC per-channel batch normalization inference:
out = ((x - mean) / sqrt(variance + epsilon)) * gamma + beta. - batch_
norm2d_ nhwc_ with_ config - NHWC per-channel batch normalization inference with explicit parallelization heuristics.
- batch_
norm2d_ nhwc_ with_ config_ and_ pool - binary_
same_ shape_ dispatch - conv2d_
nhwc - NHWC convolution without padding using kernel shape
[KH, KW, C_in, C_out]. - conv2d_
nhwc_ with_ config - NHWC convolution without padding with explicit parallelization heuristics.
- conv2d_
nhwc_ with_ config_ and_ pool - conv3d
- 3D convolution: input [B, D, H, W, C_in], kernel [KD, KH, KW, C_in, C_out], output [B, OD, OH, OW, C_out] Supports padding and stride in all 3 dimensions.
- deformable_
conv2d_ nhwc - NHWC deformable convolution with learned offsets.
- depthwise_
conv2d_ nhwc - NHWC depthwise convolution without padding using kernel shape
[KH, KW, C, depth_multiplier]. - depthwise_
conv2d_ nhwc_ with_ config - NHWC depthwise convolution without padding with explicit parallelization heuristics.
- depthwise_
conv2d_ nhwc_ with_ config_ and_ pool - dropout
- Applies dropout: randomly zeroes elements with probability
p. - embedding_
lookup - Looks up embeddings from a weight matrix.
- exp
- Elementwise exp activation.
- exp_
slice_ dispatch - Fast exp approximation applied element-wise:
output[i] = exp(input[i]). - exp_
with_ config - Elementwise exp with explicit elementwise parallelization heuristics.
- exp_
with_ config_ and_ pool - Safety
- flash_
attention - Memory-efficient (flash) attention — same result as
scaled_dot_product_attentionbut uses O(Br×Bc) peak memory instead of O(seq_q×seq_k). - fma_
slice_ dispatch - Fused multiply-accumulate:
acc[i] += a[i] * b[i]. - gelu
- Elementwise GELU activation (fast approximation):
x * sigmoid(1.702 * x). - group_
norm_ nhwc - NHWC group normalization: normalize within groups of channels.
- group_
norm_ nhwc_ with_ config - NHWC group normalization with explicit parallelization heuristics.
- group_
norm_ nhwc_ with_ config_ and_ pool - layer_
norm_ last_ dim - Layer normalization over the last tensor dimension.
- layer_
norm_ last_ dim_ with_ config - Layer normalization over the last tensor dimension with explicit elementwise parallelization heuristics.
- layer_
norm_ last_ dim_ with_ config_ and_ pool - Safety
- log_
softmax_ last_ dim - Log-softmax along the last tensor dimension.
- log_
softmax_ last_ dim_ with_ config - Log-softmax along the last tensor dimension with explicit elementwise parallelization heuristics.
- log_
softmax_ last_ dim_ with_ config_ and_ pool - Safety
- logsumexp_
last_ dim - Log-sum-exp reduction along the last tensor dimension.
- logsumexp_
last_ dim_ with_ config - Log-sum-exp reduction along the last tensor dimension with explicit elementwise parallelization heuristics.
- logsumexp_
last_ dim_ with_ config_ and_ pool - matmul_
2d - Deterministic rank-2 matrix multiplication:
(m x k) * (k x n) -> (m x n). - matmul_
2d_ sequential - Single-thread deterministic rank-2 matrix multiplication.
- matmul_
2d_ with_ config - Rank-2 matrix multiplication with explicit parallelization heuristics.
- matmul_
2d_ with_ config_ and_ pool - matmul_
2d_ with_ threads - Rank-2 matrix multiplication executed through a dedicated thread pool.
- matmul_
row_ ⚠dispatch - Dispatch to the best available SIMD path for a single matmul output row.
- max_
pool2d_ nhwc - NHWC max-pooling without padding.
- max_
pool2d_ nhwc_ with_ config - NHWC max-pooling without padding with explicit parallelization heuristics.
- max_
pool2d_ nhwc_ with_ config_ and_ pool - max_
reduce_ dispatch - Find the maximum value in
data. Returnsf32::NEG_INFINITYfor empty slices. - mish
- Elementwise Mish activation:
x * tanh(ln(1 + exp(x))). - mul
- Backend-agnostic convenience call for mul.
- mul_
with_ config - Backend-agnostic multiply with explicit elementwise parallelization heuristics.
- mul_
with_ config_ and_ pool - relu
- Elementwise ReLU activation.
- relu_
inplace - In-place ReLU activation: clamps negative values to zero.
- relu_
out - ReLU writing into pre-allocated output tensor. Zero allocation overhead.
- relu_
slice_ dispatch - relu_
to_ slice_ dispatch - Two-argument ReLU:
output[i] = max(0, input[i]). - relu_
with_ config - Elementwise ReLU with explicit elementwise parallelization heuristics.
- relu_
with_ config_ and_ pool - Safety
- rms_
norm_ last_ dim - RMS normalization over the last tensor dimension.
- rms_
norm_ last_ dim_ with_ config - RMS normalization over the last tensor dimension with explicit parallelization heuristics.
- rms_
norm_ last_ dim_ with_ config_ and_ pool - scaled_
dot_ product_ attention - Scaled dot-product attention for 2-D (unbatched) inputs.
- separable_
conv2d_ nhwc - NHWC separable convolution without padding:
depthwise (
[KH, KW, C, depth_multiplier]) then pointwise ([1, 1, C*depth_multiplier, C_out]). - separable_
conv2d_ nhwc_ with_ config - NHWC separable convolution without padding with explicit parallelization heuristics.
- separable_
conv2d_ nhwc_ with_ config_ and_ pool - sigmoid
- Elementwise sigmoid activation.
- sigmoid_
slice_ dispatch - Fast sigmoid applied element-wise:
output[i] = 1 / (1 + exp(-input[i])). - sigmoid_
with_ config - Elementwise sigmoid with explicit elementwise parallelization heuristics.
- sigmoid_
with_ config_ and_ pool - Safety
- silu
- Elementwise SiLU (Swish) activation:
x * sigmoid(x). - softmax_
last_ dim - Softmax along the last tensor dimension.
- softmax_
last_ dim_ with_ config - Softmax along the last tensor dimension with explicit elementwise parallelization heuristics.
- softmax_
last_ dim_ with_ config_ and_ pool - Safety
- sub
- Backend-agnostic convenience call for sub.
- sub_
exp_ slice_ dispatch - Fused subtract-and-exp:
output[i] = exp(input[i] - offset). - sub_
with_ config - Backend-agnostic subtract with explicit elementwise parallelization heuristics.
- sub_
with_ config_ and_ pool - tanh_
act - Elementwise tanh activation.
- tanh_
act_ with_ config - Elementwise tanh with explicit elementwise parallelization heuristics.
- tanh_
act_ with_ config_ and_ pool - Safety
- tanh_
slice_ dispatch - Fast tanh applied element-wise:
output[i] = tanh(input[i]). - transpose_
conv2d_ nhwc - CPU transposed convolution (deconvolution) in NHWC layout.