Module backend

Expand description

Unified Backend trait for CUDA, Metal, and CPU compute.

Each backend implements the same set of transformer-layer primitives (GEMM, norms, RoPE, attention, activations). layer_forward() and ModelRunner are generic over Backend, so one forward path serves all hardware targets.

Re-exports§

pub use dtype::Dtype;
pub use dtype::HostDtype;
pub use buffer::CpuBuf;

Modules§

buffer: Typed buffer wrappers — Phase B foundation.
cpu: CPU backend using Accelerate (macOS) / portable fallback (Linux). Context = () — all ops execute immediately, no batching needed.
dtype: Runtime element type tag for typed device buffers.
timer: Cross-backend GPU-side timer trait — PLAYBOOK § Phase 1.1.

Structs§

AttnConfig: Configuration for attention dispatch.
KvBf16: BF16 KV cache (drop-in replacement for FP16 on Ampere+ / Apple Silicon).
KvCache: Per-layer KV cache. Each model owns its own Vec<KvCache<B, K>> per sequence. The K: KvDtypeKind parameter selects the cache element type — defaults to KvFp16 so existing call sites that wrote KvCache<B> keep compiling unchanged.
KvCacheQuant: Quantized-KV cache (Dim 5 INT8 / future FP8 paths). Sibling of KvCache for backends that store K/V in a non-FP16 element type plus per-token per-kv-head scales.
KvFp8: FP8 KV cache — E4M3 by default. Hopper+ on CUDA, future on Metal.
KvFp16: FP16 KV cache (the existing default on CUDA + Metal).
KvInt8: INT8 KV cache — half the memory of FP16 with per-token / per-channel scale factors. CUDA path planned via vLLM’s quant_kv kernels.
MoeRouting: Routing buffers consumed by moe_gemm_phase_vllm — held by the caller across phase 1 and phase 3 of one MoE forward. All three fields are i32 device tensors in disguise (Self::Buffer = fp16 on CUDA; the backend reinterprets the underlying device pointer).
QuantWeights: Packed quantized weight buffers passed to Backend::gemm_quant.

Enums§

GgufQuantType: GGUF quantization sub-type (expand as kernels are added).
QuantKind: Quantization flavour discriminator for Backend::gemm_quant.
ReduceOp: Collective-op reduction kind for TP all_reduce.
SrcDtype: Source dtype for a weight tensor read straight from safetensors mmap.

Constants§

MAX_LAYERS_FOR_GRAPH: Maximum decode-graph layer count. Per-layer call sites that share graph-captured host staging arrays use this as the stride between distinct slots. CUDA-only invariant (other backends ignore the slot argument); 64 covers all current LLM families up to and including Llama-3-70B (80 layers — but 70B doesn’t run on a single 4090 anyway, so 64 is safe in practice for v0.2).

Traits§

Backend: The core abstraction over CUDA / Metal / CPU.
BackendCollective: Capability-trait for backends that support multi-rank collective ops. Single-GPU backends inherit the no-op defaults: world_size = 1, rank = 0, and the collective ops are identity. Multi-rank backends (CUDA + NCCL today, AMD + RCCL in the future) override these.
BackendGraph: Capability-trait for backends that can capture and replay execution as a graph (CUDA Graph). Models that call these methods bound their generic on B: BackendGraph; backends without graph support (Metal, CPU) impl this trait with an empty body and inherit no-op / unsupported defaults.
BackendInt8KvOps: INT8 KV cache operations (Dim 5).
BackendKvDtype: Capability-trait for backends that can store + read a KV cache of type K.
BackendMoeFused: Capability-trait for backends that natively dispatch MoE post-ops + routing.
BackendPagedKv: Capability-trait for backends that support paged KV cache + paged attention.
BackendQuantGguf: Capability-trait for backends that natively dispatch GGUF k-quant GEMM / GEMV. Metal wires its q4k/q6k shaders here; CUDA/CPU inherit defaults that error.
BackendQuantMarlin: Capability-trait for backends that natively support Marlin INT4 GEMM. CUDA wires this to the Marlin (or vLLM marlin_moe_wna16) tile kernels; other backends inherit defaults that error or no-op.
KvDtypeKind: Marker trait + metadata for a KV cache element type.
KvLayer: Per-K-dtype dispatch trait.
LlmBackend: Minimum capability set for a decoder-only LLM: the core compute trait plus paged-KV cache + graph-capture support. Every concrete backend (CUDA / Metal / CPU) satisfies this.
MoeLlmBackend: MoE-capable LLM backend: adds the fused MoE routing + post-op kernels to the quant LLM bundle. Required by Qwen3-MoE / future MoE models.
QuantLlmBackend: LLM backend that also supports quantized weight loading (GPTQ Marlin for CUDA; GGUF k-quant for Metal). Required by models that hold Box<dyn Linear<B>> where the Linear impl might be a quant variant.

Module backend

Module backend Copy item path

Re-exports§

Modules§

Structs§

Enums§

Constants§

Traits§

Module backend