Expand description
Concrete Linear<B> impls for quantized weights.
Phase 3e moves the per-backend kernel-dispatch logic out of the
BackendQuantMarlin / BackendQuantGguf trait method bodies and
into these concrete Linear<B> types. Each impl owns the
cudarc / metal / cpu kernel call directly — no more B::gemm_gptq
indirection.
Why these live in ferrum-kernels rather than ferrum-quantization:
the forward() body needs cudarc / metal-rs types, and pulling
those into ferrum-quantization would create a dep cycle (kernels
→ quantization → kernels). ferrum-quantization stays as the
weight-format parser layer; backend-specific Linear impls live here.
Modules§
- cpu_
dequant Linear<CpuBackend>impl for GPTQ weights, dequantized at load time.- cpu_
gguf Linear<CpuBackend>impl for GGUF k-quant weights.- cpu_
marlin_ stack MarlinExpertStack<CpuBackend>impl on top of CPU’s dequant-on-load GptqStore. Facade — delegates to the existingBackendQuantMarlin::moe_gemm_phase_*(default trait impl that loops callinggemm_gptq_with_offset_stridedon CPU) andmake_stacked_expert_linearmethods.