Skip to main content

Module backend

Module backend 

Source
Expand description

Unified Backend trait for CUDA, Metal, and CPU compute.

Each backend implements the same set of transformer-layer primitives (GEMM, norms, RoPE, attention, activations). layer_forward() and ModelRunner are generic over Backend, so one forward path serves all hardware targets.

Re-exports§

pub use dtype::Dtype;
pub use dtype::HostDtype;
pub use buffer::CpuBuf;

Modules§

buffer
Typed buffer wrappers — Phase B foundation.
cpu
CPU backend using Accelerate (macOS) / portable fallback (Linux). Context = () — all ops execute immediately, no batching needed.
dtype
Runtime element type tag for typed device buffers.
timer
Cross-backend GPU-side timer trait — PLAYBOOK § Phase 1.1.

Structs§

AttnConfig
Configuration for attention dispatch.
KvBf16
BF16 KV cache (drop-in replacement for FP16 on Ampere+ / Apple Silicon).
KvCache
Per-layer KV cache. Each model owns its own Vec<KvCache<B, K>> per sequence. The K: KvDtypeKind parameter selects the cache element type — defaults to KvFp16 so existing call sites that wrote KvCache<B> keep compiling unchanged.
KvCacheQuant
Quantized-KV cache (Dim 5 INT8 / future FP8 paths). Sibling of KvCache for backends that store K/V in a non-FP16 element type plus per-token per-kv-head scales.
KvFp8
FP8 KV cache — E4M3 by default. Hopper+ on CUDA, future on Metal.
KvFp16
FP16 KV cache (the existing default on CUDA + Metal).
KvInt8
INT8 KV cache — half the memory of FP16 with per-token / per-channel scale factors. CUDA path planned via vLLM’s quant_kv kernels.
MoeRouting
Routing buffers consumed by moe_gemm_phase_vllm — held by the caller across phase 1 and phase 3 of one MoE forward. All three fields are i32 device tensors in disguise (Self::Buffer = fp16 on CUDA; the backend reinterprets the underlying device pointer).
QuantWeights
Packed quantized weight buffers passed to Backend::gemm_quant.

Enums§

GgufQuantType
GGUF quantization sub-type (expand as kernels are added).
QuantKind
Quantization flavour discriminator for Backend::gemm_quant.
ReduceOp
Collective-op reduction kind for TP all_reduce.
SrcDtype
Source dtype for a weight tensor read straight from safetensors mmap.

Constants§

MAX_LAYERS_FOR_GRAPH
Maximum decode-graph layer count. Per-layer call sites that share graph-captured host staging arrays use this as the stride between distinct slots. CUDA-only invariant (other backends ignore the slot argument); 64 covers all current LLM families up to and including Llama-3-70B (80 layers — but 70B doesn’t run on a single 4090 anyway, so 64 is safe in practice for v0.2).

Traits§

Backend
The core abstraction over CUDA / Metal / CPU.
BackendCollective
Capability-trait for backends that support multi-rank collective ops. Single-GPU backends inherit the no-op defaults: world_size = 1, rank = 0, and the collective ops are identity. Multi-rank backends (CUDA + NCCL today, AMD + RCCL in the future) override these.
BackendGraph
Capability-trait for backends that can capture and replay execution as a graph (CUDA Graph). Models that call these methods bound their generic on B: BackendGraph; backends without graph support (Metal, CPU) impl this trait with an empty body and inherit no-op / unsupported defaults.
BackendInt8KvOps
INT8 KV cache operations (Dim 5).
BackendKvDtype
Capability-trait for backends that can store + read a KV cache of type K.
BackendMoeFused
Capability-trait for backends that natively dispatch MoE post-ops + routing.
BackendPagedKv
Capability-trait for backends that support paged KV cache + paged attention.
BackendQuantGguf
Capability-trait for backends that natively dispatch GGUF k-quant GEMM / GEMV. Metal wires its q4k/q6k shaders here; CUDA/CPU inherit defaults that error.
BackendQuantMarlin
Capability-trait for backends that natively support Marlin INT4 GEMM. CUDA wires this to the Marlin (or vLLM marlin_moe_wna16) tile kernels; other backends inherit defaults that error or no-op.
KvDtypeKind
Marker trait + metadata for a KV cache element type.
KvLayer
Per-K-dtype dispatch trait.
LlmBackend
Minimum capability set for a decoder-only LLM: the core compute trait plus paged-KV cache + graph-capture support. Every concrete backend (CUDA / Metal / CPU) satisfies this.
MoeLlmBackend
MoE-capable LLM backend: adds the fused MoE routing + post-op kernels to the quant LLM bundle. Required by Qwen3-MoE / future MoE models.
QuantLlmBackend
LLM backend that also supports quantized weight loading (GPTQ Marlin for CUDA; GGUF k-quant for Metal). Required by models that hold Box<dyn Linear<B>> where the Linear impl might be a quant variant.