Expand description
Unified Backend trait for CUDA, Metal, and CPU compute.
Each backend implements the same set of transformer-layer primitives
(GEMM, norms, RoPE, attention, activations). layer_forward() and
ModelRunner are generic over Backend, so one forward path serves
all hardware targets.
Modules§
- cpu
- CPU backend using Accelerate (macOS) / portable fallback (Linux). Context = () — all ops execute immediately, no batching needed.
Structs§
- Attn
Config - Configuration for attention dispatch.
- KvCache
- Per-layer KV cache. Each model owns its own
Vec<KvCache<B>>per sequence. - Quant
Weights - Packed quantized weight buffers passed to
Backend::gemm_quant.
Enums§
- Gguf
Quant Type - GGUF quantization sub-type (expand as kernels are added).
- Quant
Kind - Quantization flavour discriminator for
Backend::gemm_quant. - Reduce
Op - Collective-op reduction kind for TP all_reduce.
- SrcDtype
- Source dtype for a weight tensor read straight from safetensors mmap.
Traits§
- Backend
- The core abstraction over CUDA / Metal / CPU.