Skip to main content

Crate yscv_kernels

Crate yscv_kernels 

Source
Expand description

Execution kernels and backend abstraction for yscv.

§GPU Inference (Cross-Platform via wgpu)

The gpu feature enables compute shader acceleration via wgpu — Vulkan (Linux/Windows/Android), Metal (macOS/iOS), DX12 (Windows). No CUDA dependency. GPU-accelerated operations:

  • Matrix multiplication (tiled 16×16 workgroups)
  • Elementwise: add, sub, mul
  • Activations: relu, sigmoid
  • Normalization: batch_norm, layer_norm, group_norm, rms_norm, softmax
  • Convolution: conv2d, depthwise_conv2d, separable_conv2d, transpose_conv2d
  • Pooling: max_pool2d, avg_pool2d

GPU training (backward passes) is on the roadmap. CPU backend is fully optimized with NEON/AVX/SSE SIMD on all platforms.

Structs§

BatchNorm2dParams
Tensor parameter bundle for NHWC batch-normalization inference.
CpuBackend
Deterministic CPU backend with fixed operation order.
GroupNormNhwcParams
Tensor parameter bundle for NHWC group normalization.
LayerNormLastDimParams
Tensor parameter bundle for layer normalization over the last tensor dimension.
ParallelElementwiseConfig
Parallel heuristics for CPU elementwise operations.
ParallelMatmulConfig
Parallel heuristics for CPU matmul row-splitting.
RmsNormLastDimParams
Tensor parameter bundle for RMS normalization over the last tensor dimension.
SeparableConv2dParams
Tensor parameter bundle for NHWC separable convolution: depthwise ([KH, KW, C, depth_multiplier]) then pointwise ([1, 1, C*depth_multiplier, C_out]).
ThreadedCpuBackend
CPU backend with a dedicated rayon thread pool for predictable kernel threading depth.
ThreadedCpuBackendConfig
Runtime knobs for threaded CPU backend execution behavior.

Enums§

BinaryKind
KernelError
Errors returned by kernel backends.

Constants§

CRATE_ID
DEFAULT_ELEMENTWISE_MIN_PARALLEL_ELEMENTS
DEFAULT_MATMUL_MIN_PARALLEL_OUTPUT_ELEMENTS
DEFAULT_MATMUL_MIN_PARALLEL_SHARED_DIM

Traits§

Backend
Runtime backend contract for core deterministic kernels.
BackwardOps
Extension trait for backward-pass operations.

Functions§

add
Backend-agnostic convenience call for add.
add_reduce_dispatch
Sum all values in data. Returns 0.0 for empty slices.
add_with_config
Backend-agnostic add with explicit elementwise parallelization heuristics.
add_with_config_and_pool
avg_pool2d_nhwc
NHWC average-pooling without padding.
avg_pool2d_nhwc_with_config
NHWC average-pooling without padding with explicit parallelization heuristics.
avg_pool2d_nhwc_with_config_and_pool
batch_norm2d_nhwc
NHWC per-channel batch normalization inference: out = ((x - mean) / sqrt(variance + epsilon)) * gamma + beta.
batch_norm2d_nhwc_with_config
NHWC per-channel batch normalization inference with explicit parallelization heuristics.
batch_norm2d_nhwc_with_config_and_pool
binary_same_shape_dispatch
conv2d_nhwc
NHWC convolution without padding using kernel shape [KH, KW, C_in, C_out].
conv2d_nhwc_with_config
NHWC convolution without padding with explicit parallelization heuristics.
conv2d_nhwc_with_config_and_pool
conv3d
3D convolution: input [B, D, H, W, C_in], kernel [KD, KH, KW, C_in, C_out], output [B, OD, OH, OW, C_out] Supports padding and stride in all 3 dimensions.
deformable_conv2d_nhwc
NHWC deformable convolution with learned offsets.
depthwise_conv2d_nhwc
NHWC depthwise convolution without padding using kernel shape [KH, KW, C, depth_multiplier].
depthwise_conv2d_nhwc_with_config
NHWC depthwise convolution without padding with explicit parallelization heuristics.
depthwise_conv2d_nhwc_with_config_and_pool
dropout
Applies dropout: randomly zeroes elements with probability p.
embedding_lookup
Looks up embeddings from a weight matrix.
exp
Elementwise exp activation.
exp_slice_dispatch
Fast exp approximation applied element-wise: output[i] = exp(input[i]).
exp_with_config
Elementwise exp with explicit elementwise parallelization heuristics.
exp_with_config_and_pool
Safety
flash_attention
Memory-efficient (flash) attention — same result as scaled_dot_product_attention but uses O(Br×Bc) peak memory instead of O(seq_q×seq_k).
fma_slice_dispatch
Fused multiply-accumulate: acc[i] += a[i] * b[i].
gelu
Elementwise GELU activation (fast approximation): x * sigmoid(1.702 * x).
group_norm_nhwc
NHWC group normalization: normalize within groups of channels.
group_norm_nhwc_with_config
NHWC group normalization with explicit parallelization heuristics.
group_norm_nhwc_with_config_and_pool
layer_norm_last_dim
Layer normalization over the last tensor dimension.
layer_norm_last_dim_with_config
Layer normalization over the last tensor dimension with explicit elementwise parallelization heuristics.
layer_norm_last_dim_with_config_and_pool
Safety
log_softmax_last_dim
Log-softmax along the last tensor dimension.
log_softmax_last_dim_with_config
Log-softmax along the last tensor dimension with explicit elementwise parallelization heuristics.
log_softmax_last_dim_with_config_and_pool
Safety
logsumexp_last_dim
Log-sum-exp reduction along the last tensor dimension.
logsumexp_last_dim_with_config
Log-sum-exp reduction along the last tensor dimension with explicit elementwise parallelization heuristics.
logsumexp_last_dim_with_config_and_pool
matmul_2d
Deterministic rank-2 matrix multiplication: (m x k) * (k x n) -> (m x n).
matmul_2d_sequential
Single-thread deterministic rank-2 matrix multiplication.
matmul_2d_with_config
Rank-2 matrix multiplication with explicit parallelization heuristics.
matmul_2d_with_config_and_pool
matmul_2d_with_threads
Rank-2 matrix multiplication executed through a dedicated thread pool.
matmul_row_dispatch
Dispatch to the best available SIMD path for a single matmul output row.
max_pool2d_nhwc
NHWC max-pooling without padding.
max_pool2d_nhwc_with_config
NHWC max-pooling without padding with explicit parallelization heuristics.
max_pool2d_nhwc_with_config_and_pool
max_reduce_dispatch
Find the maximum value in data. Returns f32::NEG_INFINITY for empty slices.
mish
Elementwise Mish activation: x * tanh(ln(1 + exp(x))).
mul
Backend-agnostic convenience call for mul.
mul_with_config
Backend-agnostic multiply with explicit elementwise parallelization heuristics.
mul_with_config_and_pool
relu
Elementwise ReLU activation.
relu_inplace
In-place ReLU activation: clamps negative values to zero.
relu_out
ReLU writing into pre-allocated output tensor. Zero allocation overhead.
relu_slice_dispatch
relu_to_slice_dispatch
Two-argument ReLU: output[i] = max(0, input[i]).
relu_with_config
Elementwise ReLU with explicit elementwise parallelization heuristics.
relu_with_config_and_pool
Safety
rms_norm_last_dim
RMS normalization over the last tensor dimension.
rms_norm_last_dim_with_config
RMS normalization over the last tensor dimension with explicit parallelization heuristics.
rms_norm_last_dim_with_config_and_pool
scaled_dot_product_attention
Scaled dot-product attention for 2-D (unbatched) inputs.
separable_conv2d_nhwc
NHWC separable convolution without padding: depthwise ([KH, KW, C, depth_multiplier]) then pointwise ([1, 1, C*depth_multiplier, C_out]).
separable_conv2d_nhwc_with_config
NHWC separable convolution without padding with explicit parallelization heuristics.
separable_conv2d_nhwc_with_config_and_pool
sigmoid
Elementwise sigmoid activation.
sigmoid_slice_dispatch
Fast sigmoid applied element-wise: output[i] = 1 / (1 + exp(-input[i])).
sigmoid_with_config
Elementwise sigmoid with explicit elementwise parallelization heuristics.
sigmoid_with_config_and_pool
Safety
silu
Elementwise SiLU (Swish) activation: x * sigmoid(x).
softmax_last_dim
Softmax along the last tensor dimension.
softmax_last_dim_with_config
Softmax along the last tensor dimension with explicit elementwise parallelization heuristics.
softmax_last_dim_with_config_and_pool
Safety
sub
Backend-agnostic convenience call for sub.
sub_exp_slice_dispatch
Fused subtract-and-exp: output[i] = exp(input[i] - offset).
sub_with_config
Backend-agnostic subtract with explicit elementwise parallelization heuristics.
sub_with_config_and_pool
tanh_act
Elementwise tanh activation.
tanh_act_with_config
Elementwise tanh with explicit elementwise parallelization heuristics.
tanh_act_with_config_and_pool
Safety
tanh_slice_dispatch
Fast tanh applied element-wise: output[i] = tanh(input[i]).
transpose_conv2d_nhwc
CPU transposed convolution (deconvolution) in NHWC layout.