pub trait BackendQuantMarlin: Backend {
// Provided methods
fn load_gptq(
_qweight: &[i32],
_scales: &[f32],
_qzeros: &[i32],
_g_idx: Option<&[i32]>,
_bias_host: Option<&[f32]>,
_bits: u32,
_group_size: usize,
_k: usize,
_n: usize,
) -> Result<Box<dyn Linear<Self> + Send + Sync>> { ... }
fn load_gptq_stacked(
_qweights: &[&[i32]],
_scales: &[&[f32]],
_qzeros: &[&[i32]],
_g_idx: Option<&[i32]>,
_bits: u32,
_group_size: usize,
_k: usize,
_n_per_expert: usize,
) -> Result<Arc<dyn MarlinExpertStack<Self>>> { ... }
fn pregrow_marlin_gather_scratch(_ctx: &mut Self::Context, _required: usize) { ... }
}Expand description
Capability-trait for backends that natively support Marlin INT4 GEMM. CUDA wires this to the Marlin (or vLLM marlin_moe_wna16) tile kernels; other backends inherit defaults that error or no-op.
Provided Methods§
Sourcefn load_gptq(
_qweight: &[i32],
_scales: &[f32],
_qzeros: &[i32],
_g_idx: Option<&[i32]>,
_bias_host: Option<&[f32]>,
_bits: u32,
_group_size: usize,
_k: usize,
_n: usize,
) -> Result<Box<dyn Linear<Self> + Send + Sync>>
fn load_gptq( _qweight: &[i32], _scales: &[f32], _qzeros: &[i32], _g_idx: Option<&[i32]>, _bias_host: Option<&[f32]>, _bits: u32, _group_size: usize, _k: usize, _n: usize, ) -> Result<Box<dyn Linear<Self> + Send + Sync>>
Repack raw GPTQ tensors into a backend-specific Linear<Self> impl.
Called once per layer at model load time.
Inputs are host-side slices (CPU memory) — the loader reads from
safetensors and hands them off; each backend uploads + repacks
per its own strategy. bits is typically 4; group_size is
typically 128. bias_host is optional [out_features] f32 (when
the model has fused bias, e.g. Qwen2.5 attention projections).
Phase 3e/2: returns Box<dyn Linear<Self>> directly (CUDA:
CudaMarlinLinear, CPU: CpuGptqLinear). Kernel dispatch lives
inside the boxed Linear’s forward — the old gemm_gptq trait
method is gone.
Sourcefn load_gptq_stacked(
_qweights: &[&[i32]],
_scales: &[&[f32]],
_qzeros: &[&[i32]],
_g_idx: Option<&[i32]>,
_bits: u32,
_group_size: usize,
_k: usize,
_n_per_expert: usize,
) -> Result<Arc<dyn MarlinExpertStack<Self>>>
fn load_gptq_stacked( _qweights: &[&[i32]], _scales: &[&[f32]], _qzeros: &[&[i32]], _g_idx: Option<&[i32]>, _bits: u32, _group_size: usize, _k: usize, _n_per_expert: usize, ) -> Result<Arc<dyn MarlinExpertStack<Self>>>
Load num_experts GPTQ weight tiles into ONE stacked store, with the property that each expert’s packed bytes are contiguous in the resulting store. This is what the offset GEMM needs to dispatch per expert via pointer offset alone.
Why this is a separate API from load_gptq + post-hoc concat:
Marlin’s repack permutes data in [K-tile-row outer, N-tile inner]
order. A single repack of concat(all experts along N) produces
a buffer where expert e’s bytes are spread across K-tile-rows,
NOT contiguous. Per-expert repack-then-concat keeps each
expert’s data in one contiguous block.
qweights[i] / scales[i] / qzeros[i] are each expert’s raw GPTQ
tensors. All share the same K + group_size + bits + g_idx.
Default returns Err(unsupported); override on backends with a
per-expert MoE GPTQ path.
Phase C step 4e: returns the trait-object MarlinExpertStack
directly. Internally, each backend constructs its own opaque
repacked tile (Marlin: per-expert-then-concat; CPU: dequantized
f32 weight slab) and wraps it in the concrete
{Cuda,Cpu}MarlinExpertStack impl.
Removing Self::GptqStore from the public API kills the type
leak that previously forced ExpertStack<B> to carry
Option<Arc<B::GptqStore>>. Adding a new Marlin backend now
only requires implementing this method + a fresh
MarlinExpertStack<NewBackend> impl — no Backend trait edits.
Sourcefn pregrow_marlin_gather_scratch(_ctx: &mut Self::Context, _required: usize)
fn pregrow_marlin_gather_scratch(_ctx: &mut Self::Context, _required: usize)
Pre-grow any backend-internal scratch slots whose size depends
on m_total * intermediate_size (the largest matmul fan-in
inside unified_forward_internal). Default no-op. CUDA
implements this to grow the perm-aware Marlin gather scratch
EAGERLY before the caller enters a CUDA-graph capture region —
cuLaunchKernel after a runtime alloc inside a captured
stream returns CUDA_ERROR_INVALID_VALUE.
Dyn Compatibility§
This trait is not dyn compatible.
In older versions of Rust, dyn compatibility was called "object safety".