Skip to main content

Crate rlx_gguf_convert

Crate rlx_gguf_convert 

Source
Expand description

Convert tensors from external formats (safetensors, ONNX) into GGUF with per-tensor quantization. Designed to be called at first inference load: read the source file once, write a GGUF blob with a chosen quant scheme, then on subsequent loads dequant the GGUF directly — cutting both disk footprint and memory at load time for transformer weights (often ≥4× shrink at Q4_K_M).

§Quick start

use rlx_gguf_convert::{Converter, Scheme};

let report = Converter::from_safetensors("model.safetensors")?
    .default_scheme(Scheme::Q4_K)
    .skip_quant_for(|name, shape| {
        // Tiny 1-D tensors (norms, biases) stay full-precision.
        name.contains("norm") || name.contains("bias") || shape.len() < 2
    })
    .architecture("llama")
    .write_gguf("model.q4_k.gguf")?;
println!("wrote {} tensors, {:.2}× smaller",
         report.tensors,
         report.compression_ratio());

§Real-weight benchmarks

Validated end-to-end against two production checkpoints (mean cosine is the average of Converter::write_gguf output dequantized and compared back to the source values for every quantized weight tensor; non-quantized tensors round-trip exactly and aren’t included). M2 mini, release build.

ModelSource sizeSchemeOutputShrinkMean cosineWall
Bio_ClinicalBERT416 MB F32Q8_0113 MB3.75×0.9999840.27s
Bio_ClinicalBERT416 MB F32Q6_K86 MB4.85×0.9998150.22s
Bio_ClinicalBERT416 MB F32Q4_K59 MB7.05×0.9967850.44s
Bio_ClinicalBERT416 MB F32Q4_059 MB7.05×0.9961690.44s
Qwen3-TTS 0.6B1.7 GB BF16Q4_K491 MB3.55×0.9967123.7s

ConvertReport::compression_ratio reports source-byte shrink (BF16 inputs naturally compress less than F32 inputs because they’re already 2× smaller on disk).

§Per-tensor schemes

Three priority levels, applied in order:

  1. Exact-name override — Converter::scheme_for_name.
  2. Predicate override — Converter::scheme_for returning Some(scheme) to override or None to fall through.
  3. Default — Converter::default_scheme.

Tensors whose element count doesn’t divide the chosen scheme’s block size fall back to F16. Tensors matched by Converter::skip_quant_for stay at their source dtype (preserved via Scheme::F32 / Scheme::F16 / Scheme::BF16).

§Crate layout

Structs§

ConvertReport
Conversion summary returned by Converter::write_gguf. Use it to log compression ratios, generate a per-scheme histogram, or drive a re-convert pass with different scheme rules.
Converter
Top-level conversion driver. Build with Converter::from_reader (or the from_safetensors / from_onnx convenience constructors behind their feature gates), set a default + per-tensor scheme, then Converter::write_gguf.
NamedTensor

Enums§

GgmlType
MetaValue
Scheme
Quantization scheme to apply to a tensor when converting. Mirrors the GgmlType variants we have encoders for.

Traits§

TensorReader
Source-file reader contract. Implementations live in this module behind cargo features; downstream crates can plug in their own.