Expand description
Unified Backend trait for CUDA, Metal, and CPU compute.
Each backend implements the same set of transformer-layer primitives
(GEMM, norms, RoPE, attention, activations). layer_forward() and
ModelRunner are generic over Backend, so one forward path serves
all hardware targets.
Re-exports§
Modules§
- buffer
- Typed buffer wrappers — Phase B foundation.
- cpu
- CPU backend using Accelerate (macOS) / portable fallback (Linux). Context = () — all ops execute immediately, no batching needed.
- dtype
- Runtime element type tag for typed device buffers.
- timer
- Cross-backend GPU-side timer trait — PLAYBOOK § Phase 1.1.
Structs§
- Attn
Config - Configuration for attention dispatch.
- KvBf16
- BF16 KV cache (drop-in replacement for FP16 on Ampere+ / Apple Silicon).
- KvCache
- Per-layer KV cache. Each model owns its own
Vec<KvCache<B, K>>per sequence. TheK: KvDtypeKindparameter selects the cache element type — defaults toKvFp16so existing call sites that wroteKvCache<B>keep compiling unchanged. - KvCache
Quant - Quantized-KV cache (Dim 5 INT8 / future FP8 paths). Sibling of
KvCachefor backends that store K/V in a non-FP16 element type plus per-token per-kv-head scales. - KvFp8
- FP8 KV cache — E4M3 by default. Hopper+ on CUDA, future on Metal.
- KvFp16
- FP16 KV cache (the existing default on CUDA + Metal).
- KvInt8
- INT8 KV cache — half the memory of FP16 with per-token / per-channel scale factors. CUDA path planned via vLLM’s quant_kv kernels.
- MoeRouting
- Routing buffers consumed by
moe_gemm_phase_vllm— held by the caller across phase 1 and phase 3 of one MoE forward. All three fields are i32 device tensors in disguise (Self::Buffer = fp16on CUDA; the backend reinterprets the underlying device pointer). - Quant
Weights - Packed quantized weight buffers passed to
Backend::gemm_quant.
Enums§
- Gguf
Quant Type - GGUF quantization sub-type (expand as kernels are added).
- Quant
Kind - Quantization flavour discriminator for
Backend::gemm_quant. - Reduce
Op - Collective-op reduction kind for TP all_reduce.
- SrcDtype
- Source dtype for a weight tensor read straight from safetensors mmap.
Constants§
- MAX_
LAYERS_ FOR_ GRAPH - Maximum decode-graph layer count. Per-layer call sites that share
graph-captured host staging arrays use this as the stride between
distinct slots. CUDA-only invariant (other backends ignore the
slotargument); 64 covers all current LLM families up to and including Llama-3-70B (80 layers — but 70B doesn’t run on a single 4090 anyway, so 64 is safe in practice for v0.2).
Traits§
- Backend
- The core abstraction over CUDA / Metal / CPU.
- Backend
Collective - Capability-trait for backends that support multi-rank collective ops.
Single-GPU backends inherit the no-op defaults:
world_size = 1,rank = 0, and the collective ops are identity. Multi-rank backends (CUDA + NCCL today, AMD + RCCL in the future) override these. - Backend
Graph - Capability-trait for backends that can capture and replay execution as
a graph (CUDA Graph). Models that call these methods bound their
generic on
B: BackendGraph; backends without graph support (Metal, CPU) impl this trait with an empty body and inherit no-op /unsupporteddefaults. - Backend
Int8 KvOps - INT8 KV cache operations (Dim 5).
- Backend
KvDtype - Capability-trait for backends that can store + read a KV cache of
type
K. - Backend
MoeFused - Capability-trait for backends that natively dispatch MoE post-ops + routing.
- Backend
Paged Kv - Capability-trait for backends that support paged KV cache + paged attention.
- Backend
Quant Gguf - Capability-trait for backends that natively dispatch GGUF k-quant GEMM / GEMV. Metal wires its q4k/q6k shaders here; CUDA/CPU inherit defaults that error.
- Backend
Quant Marlin - Capability-trait for backends that natively support Marlin INT4 GEMM. CUDA wires this to the Marlin (or vLLM marlin_moe_wna16) tile kernels; other backends inherit defaults that error or no-op.
- KvDtype
Kind - Marker trait + metadata for a KV cache element type.
- KvLayer
- Per-K-dtype dispatch trait.
- LlmBackend
- Minimum capability set for a decoder-only LLM: the core compute trait plus paged-KV cache + graph-capture support. Every concrete backend (CUDA / Metal / CPU) satisfies this.
- MoeLlm
Backend - MoE-capable LLM backend: adds the fused MoE routing + post-op kernels to the quant LLM bundle. Required by Qwen3-MoE / future MoE models.
- Quant
LlmBackend - LLM backend that also supports quantized weight loading (GPTQ Marlin
for CUDA; GGUF k-quant for Metal). Required by models that hold
Box<dyn Linear<B>>where the Linear impl might be a quant variant.