Skip to main content

Module marlin_expert_stack

Module marlin_expert_stack 

Source
Expand description

MarlinExpertStack<B> — abstraction for “N MoE experts’ Marlin GPTQ-INT4 tiles stored contiguously, dispatched as bucketed batched GEMM or vLLM fused MoE kernel”.

Phase C sibling to StackedExpertGgufLinear<B> (GGUF) and Linear<B> (single-tensor). Same goal: drop type GptqStore from the Backend trait by routing dispatch through a Box<dyn MarlinExpertStack<B>> returned by the loader — so future backends only need to implement this trait, not edit the Backend supertrait stack.

Concrete impls (added in Phase C step 2):

  • quant_linear::cuda_marlin_stack::CudaMarlinExpertStack wraps Arc<GptqStoreCuda> and dispatches to marlin_gemm_with_offset_strided (bucketed) or marlin_moe_wna16 (vLLM fused).
  • CPU dequant path stays per-Linear (no batched MoE Marlin kernel).

The trait surface is intentionally small — three GEMM methods + a workspace zero + an expert-view constructor. Each maps 1:1 to an existing Backend::moe_gemm_phase_* method that Phase C step 3 will delete from the trait.

Traits§

MarlinExpertStack
MoE-stacked Marlin INT4 expert tile: holds N experts’ weights for one matmul role (gate_up / down) in one contiguous repacked Marlin buffer, dispatches per-expert column-slice GEMMs in a single fused launch (vLLM marlin_moe_wna16) or as a bucketed batched call.