Crate ferrum_quantization

Expand description

Weight-format abstraction for Ferrum models.

Separates “what is the weight matrix like” (dense f32, GPTQ int4, AWQ, GGUF, …) from “what device does the math” (Backend) and “how does the model wire things together” (model code).

Usage in model code:

let qkv: Box<dyn Linear<B>> = loader.load_linear("model.layers.0.self_attn.qkv_proj")?;
qkv.forward(ctx, &input, &mut out, m);

The Linear trait dispatches to the appropriate backend kernel (B::gemm for Dense, B::gemm_gptq for GPTQ, etc.) without the model having to branch on quantization type.

Re-exports§

pub use dense::DenseLinear;
pub use gguf::GgufFile;
pub use gguf::GgufLinear;
pub use gguf::GgufLoader;
pub use gptq::GptqLinear;
pub use gptq::StackedExpertLinear;
pub use loader::PrefixedLoader;
pub use loader::WeightLoader;
pub use lora::LoraLinearRef;
pub use native_safetensors::NativeSafetensorsLoader;
pub use quant_linear::QuantLinear;
pub use config::QuantConfig;
pub use config::QuantMethod;

Modules§

config: Quantization configuration parsed from model metadata.
dense: Dense linear projection — the baseline, uses B::gemm directly.
gguf: GGUF (GGML Universal Format) reader.
gptq: GPTQ linear projection — thin factory wrapper.
loader: WeightLoader trait — unified interface for loading tensor/linear weights into a specific backend.
lora: LoRA reference utilities.
native_safetensors: Native safetensors WeightLoader<B> — mmap + safetensors crate, no candle dependency on the LLM hot path.
quant_linear: QuantLinear<B> — thin wrapper that delegates to the boxed Linear<B> returned by B::load_quant / B::load_quant_fused.
traits: Re-export of Linear trait (canonical home: ferrum-kernels).

Traits§

Linear: A weight-bearing linear projection.

Crate ferrum_quantization

Crate ferrum_quantization Copy item path

Re-exports§

Modules§

Traits§

Crate ferrum_quantization