Skip to main content

Module linear

Module linear 

Source
Expand description

GgufLinear<B>: a GGUF-sourced linear projection that integrates with ferrum’s Linear<B> trait.

Phase 1B uses an eager-dequant-at-load strategy: when constructed from a candle QTensor, the quantized payload is decoded to fp32 once on CPU, then handed to DenseLinear<B> so the runtime path goes through the standard B::gemm kernel. This is the simplest correct path that works uniformly across CPU / Metal / CUDA without per-backend bridging code.

Trade-off: we lose GGUF’s memory advantage (Q4_K_M @ 4.5 bits/weight becomes fp32 @ 32 bits/weight in RAM) and we don’t get fused dequant-matmul perf. Phase 1D will replace this with a real quantization-aware Linear that holds the QTensor and dispatches to Metal / CUDA Q4_K_M kernels.

Why a dedicated GgufLinear<B> type instead of just returning DenseLinear<B>? So Phase 1D can swap the internals (eager dequant → lazy QMatMul) without churning the public API of any WeightLoader that already returns Box<dyn Linear<B>>.

Structs§

GgufLinear
Linear projection backed by a GGUF-sourced quantized tensor.

Functions§

linear_from_qtensor
Convenience: build a boxed Linear<B> from a QTensor. Useful for WeightLoader impls that want a uniform Box<dyn Linear<B>> output.