Expand description
GGUF (GGML Universal Format) parser, dequantizer, quantization encoder, and file writer.
Standalone: no rlx-* dependencies. Higher-level WeightLoader /
HF name mapping lives in the separate model-builders repo (see root README).
Supports GGUF v1, v2, v3 (the live formats). Tensor dtypes
decoded today: F32, F16, BF16, Q8_0, Q4_0, Q4_1, Q5_0, Q5_1,
and the full K-quant family Q2_K, Q3_K, Q4_K, Q5_K, Q6_K, Q8_K.
The encoder side covers every dtype the decoder accepts — see
[quantize] and writer::GgufWriter. Anything outside that set
parses fine but errors on dequant_f32 so callers know exactly
which key is unreadable; extending is a one-arm match.
Endianness: little-endian assumed (the only flavor that ships in practice). The GGUF spec reserves a flag for big-endian; we don’t parse it.
§Reading a GGUF file
use rlx_gguf::GgufFile;
let f = GgufFile::from_path("model.gguf")?;
let (data, shape) = f.dequant_f32("token_embd.weight")?;§Writing a GGUF file with mixed quant schemes
use rlx_gguf::{GgmlType, GgufWriter, MetaValue, quantize};
let w_floats: Vec<f32> = /* … */;
let bias_floats: Vec<f32> = /* … */;
let mut w = GgufWriter::new();
w.set_arch("llama");
w.set_meta("general.name", MetaValue::String("my-model".into()));
// Big projection → 4-bit K-quant. Tiny bias → float-16 (stays at
// native precision so we don't pay 5% accuracy for 32 numbers).
w.add_tensor_bytes("w", vec![4096, 4096], GgmlType::Q4K,
quantize(&w_floats, GgmlType::Q4K)?)?;
w.add_tensor_bytes("b", vec![4096], GgmlType::F16,
quantize(&bias_floats, GgmlType::F16)?)?;
w.write_to_path("out.gguf")?;For end-to-end safetensors / ONNX → GGUF conversion with per-tensor
scheme rules see the companion rlx-gguf-convert crate.
Re-exports§
pub use quantize::quantize;pub use writer::GgufWriter;pub use writer::TensorPayload;
Modules§
- quantize
- Float → GGML quant encoders. Mirrors
quantize_row_*from llama.cpp’sggml-quants.c. Output is byte-identical for the legacy schemes (Q4_0/Q4_1/Q5_0/Q5_1/Q8_0); K-quants use a simpler per-sub-block min/max search than upstream’s iterativemake_qx_quants, so quality is a notch belowllama-quantizebut is fully valid GGUF output that round-trips through the dequant kernels insuper. - writer
- GGUF v3 file writer. Serializes a sequence of metadata key/value pairs and tensors. Layout matches the live spec (see lib.rs parser): little-endian, alignment padding before the data segment, tensor data appended in declaration order.
Structs§
Enums§
Constants§
- DEFAULT_
ALIGNMENT - GGUF_
MAGIC - K_
SCALE_ SIZE - Byte size of the packed scales+mins region in
block_q4_K/block_q5_K— 8 sub-blocks × 12 bits (6 bits scale + 6 bits min) = 96 bits = 12 bytes. Same layout in both formats. - QK4_0
- Legacy Q4_0 block size (32 elements).
- QK8_0
- Legacy Q8_0 block size (32 elements).
- QK_K
- Super-block size shared by every K-quant format. Per llama.cpp’s
ggml-quants.h. Tensors quantized withQ{4,5,6,8}_Kmust have an element count divisible by 256. Super-block size for K-quant formats (256 elements).
Functions§
- bytes_
for_ public - Bytes a tensor of
nelements occupies in storage fordtype. ReturnsNoneifndoesn’t divide the scheme’s block size. - dequant_
q2_ k - dequant_
q2_ k_ block - Dequantize one Q2_K super-block (84 bytes) into
out. - dequant_
q3_ k - dequant_
q3_ k_ block - Dequantize one Q3_K super-block (110 bytes) into
out. - dequant_
q4_ 0 - Full-tensor Q4_0 dequant (element count must be a multiple of
QK4_0). - dequant_
q4_ 0_ block - Dequant one Q4_0 block (
2 + QK4_0/2bytes →QK4_0f32 values). - dequant_
q4_ k - Q4_K block: 144 bytes / 256 elements (4.5 bits/element). Layout: f16 d + f16 dmin + 12-byte packed scales/mins + 128 nibbles.
- dequant_
q4_ k_ block - Dequantize one Q4_K super-block (144 bytes) into
out(256 f32s). - dequant_
q5_ k - Q5_K block: 176 bytes / 256 elements (5.5 bits/element).
Layout: f16 d + f16 dmin + 12-byte packed scales/mins + 32-byte
high-bits + 128 nibbles. Each element’s 5th bit lives in
qhindexed by position-within-super-block. - dequant_
q5_ k_ block - Dequantize one Q5_K super-block (176 bytes) into
out. - dequant_
q6_ k - Q6_K block: 210 bytes / 256 elements (6.5625 bits/element). The
highest-quality K-quant; common in
*-Q6_K.ggufmodel dumps. Layout: 128 low-nibble bytes + 64 high-2-bit bytes + 16 i8 scales - dequant_
q6_ k_ block - Dequantize one Q6_K super-block (210 bytes) into
out. - dequant_
q8_ 0 - Full-tensor Q8_0 dequant (element count must be a multiple of
QK8_0). - dequant_
q8_ 0_ block - Dequant one Q8_0 block (
2 + QK8_0bytes →QK8_0f32 values). - dequant_
q8_ k - Q8_K block: 276 bytes / 256 elements. Mostly an intermediate
format used inside llama.cpp’s matmul kernels, but some dumps do
store it directly. We only need to materialize the i8 quants ×
the f32 super-block scale;
bsums(per-16-block partial sums) is metadata we can safely ignore for plain dequant. - dequant_
q8_ k_ block - Dequantize one Q8_K super-block (276 bytes) into
out.