Expand description
Kernel backend abstraction layer for LLM-specific fused operations.
This module defines a mid-level abstraction between raw KernelExecutor
(too low-level: grid/block sizes) and TensorOps (too high-level: no
LLM-specific fused ops). It enables pluggable CUDA/Metal/CPU backends
through six focused sub-traits composed into one umbrella KernelOps.
Structs§
- Attention
Params - Parameters describing a single attention call.
- Kernel
OpsDispatch - Dispatch wrapper that tries
KernelOpsfirst, then falls back toTensorOpsfor operations that have aTensorOpsequivalent. - RoPE
Config - Rotary position embedding configuration.
- Sampling
Params - Sampling parameters for GPU-side token sampling.
Enums§
- Quant
Scheme - Quantization scheme descriptor for quantized linear ops.
Traits§
- Activation
Ops - Activation function operations (including fused variants).
- Attention
Ops - Attention operations.
- Kernel
Ops - Unified kernel operations interface.
- Linear
Ops - Linear / matrix-multiply operations.
- NormOps
- Normalization operations.
- Position
Ops - Positional encoding operations.
- Sampling
Ops - Token sampling operations (GPU-side when possible).