Expand description
Convert tensors from external formats (safetensors, ONNX) into GGUF with per-tensor quantization. Designed to be called at first inference load: read the source file once, write a GGUF blob with a chosen quant scheme, then on subsequent loads dequant the GGUF directly — cutting both disk footprint and memory at load time for transformer weights (often ≥4× shrink at Q4_K_M).
§Quick start
use rlx_gguf_convert::{Converter, Scheme};
let report = Converter::from_safetensors("model.safetensors")?
.default_scheme(Scheme::Q4_K)
.skip_quant_for(|name, shape| {
// Tiny 1-D tensors (norms, biases) stay full-precision.
name.contains("norm") || name.contains("bias") || shape.len() < 2
})
.architecture("llama")
.write_gguf("model.q4_k.gguf")?;
println!("wrote {} tensors, {:.2}× smaller",
report.tensors,
report.compression_ratio());§Real-weight benchmarks
Validated end-to-end against two production checkpoints (mean
cosine is the average of Converter::write_gguf output
dequantized and compared back to the source values for every
quantized weight tensor; non-quantized tensors round-trip exactly
and aren’t included). M2 mini, release build.
| Model | Source size | Scheme | Output | Shrink | Mean cosine | Wall |
|---|---|---|---|---|---|---|
| Bio_ClinicalBERT | 416 MB F32 | Q8_0 | 113 MB | 3.75× | 0.999984 | 0.27s |
| Bio_ClinicalBERT | 416 MB F32 | Q6_K | 86 MB | 4.85× | 0.999815 | 0.22s |
| Bio_ClinicalBERT | 416 MB F32 | Q4_K | 59 MB | 7.05× | 0.996785 | 0.44s |
| Bio_ClinicalBERT | 416 MB F32 | Q4_0 | 59 MB | 7.05× | 0.996169 | 0.44s |
| Qwen3-TTS 0.6B | 1.7 GB BF16 | Q4_K | 491 MB | 3.55× | 0.996712 | 3.7s |
ConvertReport::compression_ratio reports source-byte shrink
(BF16 inputs naturally compress less than F32 inputs because
they’re already 2× smaller on disk).
§Per-tensor schemes
Three priority levels, applied in order:
- Exact-name override —
Converter::scheme_for_name. - Predicate override —
Converter::scheme_forreturningSome(scheme)to override orNoneto fall through. - Default —
Converter::default_scheme.
Tensors whose element count doesn’t divide the chosen scheme’s
block size fall back to F16. Tensors matched by
Converter::skip_quant_for stay at their source dtype
(preserved via Scheme::F32 / Scheme::F16 / Scheme::BF16).
§Crate layout
Scheme/Converter/ConvertReportare the public API.- Source readers gate behind features:
safetensors(default) —.safetensorsfiles.onnx— ONNX initializer tensors viarlx-onnx-import.
- The encoder side is shared with
rlx_gguf, so output round-trips throughrlx_gguf::GgufFile::dequant_f32.
Structs§
- Convert
Report - Conversion summary returned by
Converter::write_gguf. Use it to log compression ratios, generate a per-scheme histogram, or drive a re-convert pass with different scheme rules. - Converter
- Top-level conversion driver. Build with
Converter::from_reader(or thefrom_safetensors/from_onnxconvenience constructors behind their feature gates), set a default + per-tensor scheme, thenConverter::write_gguf. - Named
Tensor
Enums§
- Ggml
Type - Meta
Value - Scheme
- Quantization scheme to apply to a tensor when converting. Mirrors
the
GgmlTypevariants we have encoders for.
Traits§
- Tensor
Reader - Source-file reader contract. Implementations live in this module behind cargo features; downstream crates can plug in their own.