Crate rlx_gguf_convert

Expand description

Convert tensors from external formats (safetensors, ONNX) into GGUF with per-tensor quantization. Designed to be called at first inference load: read the source file once, write a GGUF blob with a chosen quant scheme, then on subsequent loads dequant the GGUF directly — cutting both disk footprint and memory at load time for transformer weights (often ≥4× shrink at Q4_K_M).

§Quick start

use rlx_gguf_convert::{Converter, Scheme};

let report = Converter::from_safetensors("model.safetensors")?
    .default_scheme(Scheme::Q4_K)
    .skip_quant_for(|name, shape| {
        // Tiny 1-D tensors (norms, biases) stay full-precision.
        name.contains("norm") || name.contains("bias") || shape.len() < 2
    })
    .architecture("llama")
    .write_gguf("model.q4_k.gguf")?;
println!("wrote {} tensors, {:.2}× smaller",
         report.tensors,
         report.compression_ratio());

§Real-weight benchmarks

Validated end-to-end against two production checkpoints (mean cosine is the average of Converter::write_gguf output dequantized and compared back to the source values for every quantized weight tensor; non-quantized tensors round-trip exactly and aren’t included). M2 mini, release build.

Model	Source size	Scheme	Output	Shrink	Mean cosine	Wall
Bio_ClinicalBERT	416 MB F32	Q8_0	113 MB	3.75×	0.999984	0.27s
Bio_ClinicalBERT	416 MB F32	Q6_K	86 MB	4.85×	0.999815	0.22s
Bio_ClinicalBERT	416 MB F32	Q4_K	59 MB	7.05×	0.996785	0.44s
Bio_ClinicalBERT	416 MB F32	Q4_0	59 MB	7.05×	0.996169	0.44s
Qwen3-TTS 0.6B	1.7 GB BF16	Q4_K	491 MB	3.55×	0.996712	3.7s

ConvertReport::compression_ratio reports source-byte shrink (BF16 inputs naturally compress less than F32 inputs because they’re already 2× smaller on disk).

§Per-tensor schemes

Three priority levels, applied in order:

Exact-name override — Converter::scheme_for_name.
Predicate override — Converter::scheme_for returning Some(scheme) to override or None to fall through.
Default — Converter::default_scheme.

Tensors whose element count doesn’t divide the chosen scheme’s block size fall back to F16. Tensors matched by Converter::skip_quant_for stay at their source dtype (preserved via Scheme::F32 / Scheme::F16 / Scheme::BF16).

§Crate layout

Scheme / Converter / ConvertReport are the public API.
Source readers gate behind features:
- safetensors (default) — .safetensors files.
- onnx — ONNX initializer tensors via rlx-onnx-import.
The encoder side is shared with rlx_gguf, so output round-trips through rlx_gguf::GgufFile::dequant_f32.

Structs§

ConvertReport: Conversion summary returned by Converter::write_gguf. Use it to log compression ratios, generate a per-scheme histogram, or drive a re-convert pass with different scheme rules.
Converter: Top-level conversion driver. Build with Converter::from_reader (or the from_safetensors / from_onnx convenience constructors behind their feature gates), set a default + per-tensor scheme, then Converter::write_gguf.
NamedTensor

Enums§

GgmlType
MetaValue
Scheme: Quantization scheme to apply to a tensor when converting. Mirrors the GgmlType variants we have encoders for.

Traits§

TensorReader: Source-file reader contract. Implementations live in this module behind cargo features; downstream crates can plug in their own.