Skip to main content

BackendQuantMarlin

Trait BackendQuantMarlin 

Source
pub trait BackendQuantMarlin: Backend {
    // Provided methods
    fn load_gptq(
        _qweight: &[i32],
        _scales: &[f32],
        _qzeros: &[i32],
        _g_idx: Option<&[i32]>,
        _bias_host: Option<&[f32]>,
        _bits: u32,
        _group_size: usize,
        _k: usize,
        _n: usize,
    ) -> Result<Box<dyn Linear<Self> + Send + Sync>> { ... }
    fn load_gptq_stacked(
        _qweights: &[&[i32]],
        _scales: &[&[f32]],
        _qzeros: &[&[i32]],
        _g_idx: Option<&[i32]>,
        _bits: u32,
        _group_size: usize,
        _k: usize,
        _n_per_expert: usize,
    ) -> Result<Arc<dyn MarlinExpertStack<Self>>> { ... }
    fn pregrow_marlin_gather_scratch(_ctx: &mut Self::Context, _required: usize) { ... }
}
Expand description

Capability-trait for backends that natively support Marlin INT4 GEMM. CUDA wires this to the Marlin (or vLLM marlin_moe_wna16) tile kernels; other backends inherit defaults that error or no-op.

Provided Methods§

Source

fn load_gptq( _qweight: &[i32], _scales: &[f32], _qzeros: &[i32], _g_idx: Option<&[i32]>, _bias_host: Option<&[f32]>, _bits: u32, _group_size: usize, _k: usize, _n: usize, ) -> Result<Box<dyn Linear<Self> + Send + Sync>>

Repack raw GPTQ tensors into a backend-specific Linear<Self> impl. Called once per layer at model load time.

Inputs are host-side slices (CPU memory) — the loader reads from safetensors and hands them off; each backend uploads + repacks per its own strategy. bits is typically 4; group_size is typically 128. bias_host is optional [out_features] f32 (when the model has fused bias, e.g. Qwen2.5 attention projections).

Phase 3e/2: returns Box<dyn Linear<Self>> directly (CUDA: CudaMarlinLinear, CPU: CpuGptqLinear). Kernel dispatch lives inside the boxed Linear’s forward — the old gemm_gptq trait method is gone.

Source

fn load_gptq_stacked( _qweights: &[&[i32]], _scales: &[&[f32]], _qzeros: &[&[i32]], _g_idx: Option<&[i32]>, _bits: u32, _group_size: usize, _k: usize, _n_per_expert: usize, ) -> Result<Arc<dyn MarlinExpertStack<Self>>>

Load num_experts GPTQ weight tiles into ONE stacked store, with the property that each expert’s packed bytes are contiguous in the resulting store. This is what the offset GEMM needs to dispatch per expert via pointer offset alone.

Why this is a separate API from load_gptq + post-hoc concat: Marlin’s repack permutes data in [K-tile-row outer, N-tile inner] order. A single repack of concat(all experts along N) produces a buffer where expert e’s bytes are spread across K-tile-rows, NOT contiguous. Per-expert repack-then-concat keeps each expert’s data in one contiguous block.

qweights[i] / scales[i] / qzeros[i] are each expert’s raw GPTQ tensors. All share the same K + group_size + bits + g_idx.

Default returns Err(unsupported); override on backends with a per-expert MoE GPTQ path. Phase C step 4e: returns the trait-object MarlinExpertStack directly. Internally, each backend constructs its own opaque repacked tile (Marlin: per-expert-then-concat; CPU: dequantized f32 weight slab) and wraps it in the concrete {Cuda,Cpu}MarlinExpertStack impl.

Removing Self::GptqStore from the public API kills the type leak that previously forced ExpertStack<B> to carry Option<Arc<B::GptqStore>>. Adding a new Marlin backend now only requires implementing this method + a fresh MarlinExpertStack<NewBackend> impl — no Backend trait edits.

Source

fn pregrow_marlin_gather_scratch(_ctx: &mut Self::Context, _required: usize)

Pre-grow any backend-internal scratch slots whose size depends on m_total * intermediate_size (the largest matmul fan-in inside unified_forward_internal). Default no-op. CUDA implements this to grow the perm-aware Marlin gather scratch EAGERLY before the caller enters a CUDA-graph capture region — cuLaunchKernel after a runtime alloc inside a captured stream returns CUDA_ERROR_INVALID_VALUE.

Dyn Compatibility§

This trait is not dyn compatible.

In older versions of Rust, dyn compatibility was called "object safety".

Implementors§