Crate yscv_kernels

Expand description

Execution kernels and backend abstraction for yscv.

§GPU Inference (Cross-Platform via wgpu)

The gpu feature enables compute shader acceleration via wgpu — Vulkan (Linux/Windows/Android), Metal (macOS/iOS), DX12 (Windows). No CUDA dependency. GPU-accelerated operations:

Matrix multiplication (tiled 16×16 workgroups)
Elementwise: add, sub, mul
Activations: relu, sigmoid
Normalization: batch_norm, layer_norm, group_norm, rms_norm, softmax
Convolution: conv2d, depthwise_conv2d, separable_conv2d, transpose_conv2d
Pooling: max_pool2d, avg_pool2d

GPU training (backward passes) is on the roadmap. CPU backend is fully optimized with NEON/AVX/SSE SIMD on all platforms.

Structs§

BatchNorm2dParams: Tensor parameter bundle for NHWC batch-normalization inference.
CpuBackend: Deterministic CPU backend with fixed operation order.
GroupNormNhwcParams: Tensor parameter bundle for NHWC group normalization.
LayerNormLastDimParams: Tensor parameter bundle for layer normalization over the last tensor dimension.
ParallelElementwiseConfig: Parallel heuristics for CPU elementwise operations.
ParallelMatmulConfig: Parallel heuristics for CPU matmul row-splitting.
RmsNormLastDimParams: Tensor parameter bundle for RMS normalization over the last tensor dimension.
SeparableConv2dParams: Tensor parameter bundle for NHWC separable convolution: depthwise ([KH, KW, C, depth_multiplier]) then pointwise ([1, 1, C*depth_multiplier, C_out]).
ThreadedCpuBackend: CPU backend with a dedicated rayon thread pool for predictable kernel threading depth.
ThreadedCpuBackendConfig: Runtime knobs for threaded CPU backend execution behavior.

Enums§

BinaryKind
KernelError: Errors returned by kernel backends.

Constants§

CRATE_ID
DEFAULT_ELEMENTWISE_MIN_PARALLEL_ELEMENTS
DEFAULT_MATMUL_MIN_PARALLEL_OUTPUT_ELEMENTS
DEFAULT_MATMUL_MIN_PARALLEL_SHARED_DIM

Traits§

Backend: Runtime backend contract for core deterministic kernels.
BackwardOps: Extension trait for backward-pass operations.

Functions§

add: Backend-agnostic convenience call for add.
add_reduce_dispatch: Sum all values in data. Returns 0.0 for empty slices.
add_with_config: Backend-agnostic add with explicit elementwise parallelization heuristics.
add_with_config_and_pool
avg_pool2d_nhwc: NHWC average-pooling without padding.
avg_pool2d_nhwc_with_config: NHWC average-pooling without padding with explicit parallelization heuristics.
avg_pool2d_nhwc_with_config_and_pool
batch_norm2d_nhwc: NHWC per-channel batch normalization inference: out = ((x - mean) / sqrt(variance + epsilon)) * gamma + beta.
batch_norm2d_nhwc_with_config: NHWC per-channel batch normalization inference with explicit parallelization heuristics.
batch_norm2d_nhwc_with_config_and_pool
binary_same_shape_dispatch
conv2d_nhwc: NHWC convolution without padding using kernel shape [KH, KW, C_in, C_out].
conv2d_nhwc_with_config: NHWC convolution without padding with explicit parallelization heuristics.
conv2d_nhwc_with_config_and_pool
conv3d: 3D convolution: input [B, D, H, W, C_in], kernel [KD, KH, KW, C_in, C_out], output [B, OD, OH, OW, C_out] Supports padding and stride in all 3 dimensions.
deformable_conv2d_nhwc: NHWC deformable convolution with learned offsets.
depthwise_conv2d_nhwc: NHWC depthwise convolution without padding using kernel shape [KH, KW, C, depth_multiplier].
depthwise_conv2d_nhwc_with_config: NHWC depthwise convolution without padding with explicit parallelization heuristics.
depthwise_conv2d_nhwc_with_config_and_pool
dropout: Applies dropout: randomly zeroes elements with probability p.
embedding_lookup: Looks up embeddings from a weight matrix.
exp: Elementwise exp activation.
exp_slice_dispatch: Fast exp approximation applied element-wise: output[i] = exp(input[i]).
exp_with_config: Elementwise exp with explicit elementwise parallelization heuristics.
exp_with_config_and_pool: Safety
flash_attention: Memory-efficient (flash) attention — same result as scaled_dot_product_attention but uses O(Br×Bc) peak memory instead of O(seq_q×seq_k).
fma_slice_dispatch: Fused multiply-accumulate: acc[i] += a[i] * b[i].
gelu: Elementwise GELU activation (fast approximation): x * sigmoid(1.702 * x).
group_norm_nhwc: NHWC group normalization: normalize within groups of channels.
group_norm_nhwc_with_config: NHWC group normalization with explicit parallelization heuristics.
group_norm_nhwc_with_config_and_pool
layer_norm_last_dim: Layer normalization over the last tensor dimension.
layer_norm_last_dim_with_config: Layer normalization over the last tensor dimension with explicit elementwise parallelization heuristics.
layer_norm_last_dim_with_config_and_pool: Safety
log_softmax_last_dim: Log-softmax along the last tensor dimension.
log_softmax_last_dim_with_config: Log-softmax along the last tensor dimension with explicit elementwise parallelization heuristics.
log_softmax_last_dim_with_config_and_pool: Safety
logsumexp_last_dim: Log-sum-exp reduction along the last tensor dimension.
logsumexp_last_dim_with_config: Log-sum-exp reduction along the last tensor dimension with explicit elementwise parallelization heuristics.
logsumexp_last_dim_with_config_and_pool
matmul_2d: Deterministic rank-2 matrix multiplication: (m x k) * (k x n) -> (m x n).
matmul_2d_sequential: Single-thread deterministic rank-2 matrix multiplication.
matmul_2d_with_config: Rank-2 matrix multiplication with explicit parallelization heuristics.
matmul_2d_with_config_and_pool
matmul_2d_with_threads: Rank-2 matrix multiplication executed through a dedicated thread pool.
matmul_row_dispatch^⚠: Dispatch to the best available SIMD path for a single matmul output row.
max_pool2d_nhwc: NHWC max-pooling without padding.
max_pool2d_nhwc_with_config: NHWC max-pooling without padding with explicit parallelization heuristics.
max_pool2d_nhwc_with_config_and_pool
max_reduce_dispatch: Find the maximum value in data. Returns f32::NEG_INFINITY for empty slices.
mish: Elementwise Mish activation: x * tanh(ln(1 + exp(x))).
mul: Backend-agnostic convenience call for mul.
mul_with_config: Backend-agnostic multiply with explicit elementwise parallelization heuristics.
mul_with_config_and_pool
relu: Elementwise ReLU activation.
relu_inplace: In-place ReLU activation: clamps negative values to zero.
relu_out: ReLU writing into pre-allocated output tensor. Zero allocation overhead.
relu_slice_dispatch
relu_to_slice_dispatch: Two-argument ReLU: output[i] = max(0, input[i]).
relu_with_config: Elementwise ReLU with explicit elementwise parallelization heuristics.
relu_with_config_and_pool: Safety
rms_norm_last_dim: RMS normalization over the last tensor dimension.
rms_norm_last_dim_with_config: RMS normalization over the last tensor dimension with explicit parallelization heuristics.
rms_norm_last_dim_with_config_and_pool
scaled_dot_product_attention: Scaled dot-product attention for 2-D (unbatched) inputs.
separable_conv2d_nhwc: NHWC separable convolution without padding: depthwise ([KH, KW, C, depth_multiplier]) then pointwise ([1, 1, C*depth_multiplier, C_out]).
separable_conv2d_nhwc_with_config: NHWC separable convolution without padding with explicit parallelization heuristics.
separable_conv2d_nhwc_with_config_and_pool
sigmoid: Elementwise sigmoid activation.
sigmoid_slice_dispatch: Fast sigmoid applied element-wise: output[i] = 1 / (1 + exp(-input[i])).
sigmoid_with_config: Elementwise sigmoid with explicit elementwise parallelization heuristics.
sigmoid_with_config_and_pool: Safety
silu: Elementwise SiLU (Swish) activation: x * sigmoid(x).
softmax_last_dim: Softmax along the last tensor dimension.
softmax_last_dim_with_config: Softmax along the last tensor dimension with explicit elementwise parallelization heuristics.
softmax_last_dim_with_config_and_pool: Safety
sub: Backend-agnostic convenience call for sub.
sub_exp_slice_dispatch: Fused subtract-and-exp: output[i] = exp(input[i] - offset).
sub_with_config: Backend-agnostic subtract with explicit elementwise parallelization heuristics.
sub_with_config_and_pool
tanh_act: Elementwise tanh activation.
tanh_act_with_config: Elementwise tanh with explicit elementwise parallelization heuristics.
tanh_act_with_config_and_pool: Safety
tanh_slice_dispatch: Fast tanh applied element-wise: output[i] = tanh(input[i]).
transpose_conv2d_nhwc: CPU transposed convolution (deconvolution) in NHWC layout.

Crate yscv_kernels

Crate yscv_kernels Copy item path

§GPU Inference (Cross-Platform via wgpu)

Structs§

Enums§

Constants§

Traits§

Functions§

Crate yscv_kernels