Skip to main content

Module quant_linear

Module quant_linear 

Source
Expand description

Concrete Linear<B> impls for quantized weights.

Phase 3e moves the per-backend kernel-dispatch logic out of the BackendQuantMarlin / BackendQuantGguf trait method bodies and into these concrete Linear<B> types. Each impl owns the cudarc / metal / cpu kernel call directly — no more B::gemm_gptq indirection.

Why these live in ferrum-kernels rather than ferrum-quantization: the forward() body needs cudarc / metal-rs types, and pulling those into ferrum-quantization would create a dep cycle (kernels → quantization → kernels). ferrum-quantization stays as the weight-format parser layer; backend-specific Linear impls live here.

Modules§

cpu_dequant
Linear<CpuBackend> impl for GPTQ weights, dequantized at load time.
cpu_gguf
Linear<CpuBackend> impl for GGUF k-quant weights.
cpu_marlin_stack
MarlinExpertStack<CpuBackend> impl on top of CPU’s dequant-on-load GptqStore. Facade — delegates to the existing BackendQuantMarlin::moe_gemm_phase_* (default trait impl that loops calling gemm_gptq_with_offset_strided on CPU) and make_stacked_expert_linear methods.