Expand description
MarlinExpertStack<B> — abstraction for “N MoE experts’ Marlin
GPTQ-INT4 tiles stored contiguously, dispatched as bucketed batched
GEMM or vLLM fused MoE kernel”.
Phase C sibling to StackedExpertGgufLinear<B> (GGUF) and Linear<B>
(single-tensor). Same goal: drop type GptqStore from the Backend
trait by routing dispatch through a Box<dyn MarlinExpertStack<B>>
returned by the loader — so future backends only need to implement
this trait, not edit the Backend supertrait stack.
Concrete impls (added in Phase C step 2):
quant_linear::cuda_marlin_stack::CudaMarlinExpertStackwrapsArc<GptqStoreCuda>and dispatches tomarlin_gemm_with_offset_strided(bucketed) ormarlin_moe_wna16(vLLM fused).- CPU dequant path stays per-Linear (no batched MoE Marlin kernel).
The trait surface is intentionally small — three GEMM methods + a
workspace zero + an expert-view constructor. Each maps 1:1 to an
existing Backend::moe_gemm_phase_* method that Phase C step 3
will delete from the trait.
Traits§
- Marlin
Expert Stack - MoE-stacked Marlin INT4 expert tile: holds N experts’ weights for one matmul role (gate_up / down) in one contiguous repacked Marlin buffer, dispatches per-expert column-slice GEMMs in a single fused launch (vLLM marlin_moe_wna16) or as a bucketed batched call.