Expand description
GgufLinear<B>: a GGUF-sourced linear projection that integrates with
ferrum’s Linear<B> trait.
Phase 1B uses an eager-dequant-at-load strategy: when constructed from
a candle QTensor, the quantized payload is decoded to fp32 once on CPU,
then handed to DenseLinear<B> so the runtime path goes through the
standard B::gemm kernel. This is the simplest correct path that works
uniformly across CPU / Metal / CUDA without per-backend bridging code.
Trade-off: we lose GGUF’s memory advantage (Q4_K_M @ 4.5 bits/weight becomes fp32 @ 32 bits/weight in RAM) and we don’t get fused dequant-matmul perf. Phase 1D will replace this with a real quantization-aware Linear that holds the QTensor and dispatches to Metal / CUDA Q4_K_M kernels.
Why a dedicated GgufLinear<B> type instead of just returning
DenseLinear<B>? So Phase 1D can swap the internals (eager dequant →
lazy QMatMul) without churning the public API of any WeightLoader
that already returns Box<dyn Linear<B>>.
Structs§
- Gguf
Linear - Linear projection backed by a GGUF-sourced quantized tensor.
Functions§
- linear_
from_ qtensor - Convenience: build a boxed
Linear<B>from aQTensor. Useful forWeightLoaderimpls that want a uniformBox<dyn Linear<B>>output.